2019-04-25 09:27:52

by Lech Perczak

[permalink] [raw]
Subject: Recurring warning in page_copy_sane (inside copy_page_to_iter) when running stress tests involving drop_caches

Hello,

Some time ago, after upgrading the Kernel on our i.MX6Q-based boards to mainline 4.18, and now to LTS 4.19 line, during stress tests we started noticing strange warnings coming from 'read' syscall, when page_copy_sane() check failed. Typical reproducibility is up to ~4 events per 24h. Warnings origin from different processes, mostly involved with the stress tests, but not necessarily with block devices we're stressing. If the warning appeared in process relating to block device stress test, it would be accompanied by corrupted data, as the read operation gets aborted.
Our reason for dropping caches is to ensure that the actual block device gets accessed during the test without resorting to O_DIRECT.
When dropping caches was disabled in the tests, the issue would also disappear, at least in a single-week run.

Example processes caught in the last run (grepped from stacktraces):
[38128.430694] CPU: 2 PID: 32735 Comm: glmark2 Not tainted 4.19.32-devboardimx6q+g0a64e3717985 #1
[44168.081821] CPU: 2 PID: 21555 Comm: cat Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
[75879.424076] CPU: 3 PID: 1478 Comm: flashcp Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
[117060.951333] CPU: 1 PID: 14150 Comm: cat Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
[201668.438219] CPU: 1 PID: 14370 Comm: glmark2 Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
[214332.212960] CPU: 0 PID: 25633 Comm: cp Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
[283484.619058] CPU: 0 PID: 5091 Comm: glmark2 Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
[330256.709124] CPU: 2 PID: 4565 Comm: sensors Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
[337685.549556] CPU: 2 PID: 30132 Comm: sensors Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1

When I started debugging the issue, I noticed that in all cases we're dealing with highmem zero-order pages. In this case, page_head(page) == page, so page_address(page) should be equal to page_address(head).
However, it isn't the case, as page_address(head) in each case returns zero, causing the value of "v" to explode, and the check to fail.

At first I thought, that it could be caused by modification of struct page itself, but it turned out to not be the case - sometimes struct page would be unmodified during duration of the check. With lowmem pages, this wouldn't also happen, because page_address is derived directly from struct page pointer itself.

Finally, after gathering a few traces with added debug logs I decided to trigger a panic on the warning and capture a vmcore.
When analyzing the vmcore, I confirmed that the struct page_address_slot corresponding to the struct page had an empty list of mappings, explaining NULL returned from page_address(head) in page_copy_sane().
Yeah, I had to manually calculate hash of struct page pointer ;)

In the meantime I noticed a patch by Eric Dumazet ("iov_iter: optimize page_copy_sane()") [1]. Applying this patch silenced the warnings, at least for test run going for over a week. This is quite expected, as with this change 'page' isn't dereferenced at all, nor is page_address() called. Here I suspect that applying this patch may only hide a real issue I'm facing, and might be missing the case of highmem pages as well.

An example stacktrace I captured, with added debug logs and copy of struct page before and after the check in page_copy_sane:

This last one shows no changes to struct page over the call to page_copy_sane(), where first call to page_address(page) returns a valid page address (page_address=2146996224) and second call returns 0 (head_address=0). Subsequent calls to page_address inside WARN() argument list also return 0. The same was true for the occurence I captured vmcore for.

(For your reference, a patch containing my debug prints is at the end of the message)

[337685.344204] ------------[ cut here ]------------
[337685.356870] WARNING: CPU: 0 PID: 30132 at /mnt/work/git/platforms-manifest/build_devboardimx6q/tmp/work-shared/devboardimx6q/kernel-source/lib/iov_iter.c:834 page_copy_sane+0x13c/0x178
[337685.420706] page_copy_sane: page=7721123e, offset=0, n=4096, v=2147000320, head=7721123e, page_address=2146996224, head_address=0, compound_order(head)=0, page_address(page)=0, page_address(head)=0 page_address(page_copy)=0, flags(page)=referenced|uptodate|lru|active|arch_1, flags(page_copy)=referenced|uptodate|lru|active|arch_1
[337685.494736] Modules linked in: xt_nat iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_tables x_tables usb_f_mass_storage usb_f_rndis u_ether ci_hdrc_imx ci_hdrc usbmisc_imx ulpi libcomposite configfs udc_core
[337685.549556] CPU: 2 PID: 30132 Comm: sensors Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
[337685.558314] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[337685.563600] [<8010fe44>] (unwind_backtrace) from [<8010b840>] (show_stack+0x10/0x14)
[337685.570081] [<8010b840>] (show_stack) from [<807a5958>] (dump_stack+0x88/0x9c)
[337685.576033] [<807a5958>] (dump_stack) from [<8011f500>] (__warn+0xfc/0x114)
[337685.581721] [<8011f500>] (__warn) from [<8011f560>] (warn_slowpath_fmt+0x48/0x6c)
[337685.587936] [<8011f560>] (warn_slowpath_fmt) from [<803a0664>] (page_copy_sane+0x13c/0x178)
[337685.595004] [<803a0664>] (page_copy_sane) from [<803a2fe4>] (copy_page_to_iter+0x18/0x474)
[337685.602020] [<803a2fe4>] (copy_page_to_iter) from [<801bd9d4>] (generic_file_read_iter+0x2cc/0x980)
[337685.609794] [<801bd9d4>] (generic_file_read_iter) from [<8020949c>] (__vfs_read+0xf8/0x158)
[337685.616866] [<8020949c>] (__vfs_read) from [<80209588>] (vfs_read+0x8c/0x118)
[337685.622717] [<80209588>] (vfs_read) from [<80209aac>] (ksys_read+0x4c/0xac)
[337685.628396] [<80209aac>] (ksys_read) from [<80101000>] (ret_fast_syscall+0x0/0x54)
[337685.634668] Exception stack(0xa4e79fa8 to 0xa4e79ff0)
[337685.638427] 9fa0:                   00000008 00000000 00000003 7ed03208 00000200 00000000
[337685.645318] 9fc0: 00000008 00000000 76fe8958 00000003 00000000 7ed03208 76fe5640 7ed031b4
[337685.652205] 9fe0: 76fe7cf0 7ed0316c 76fbeee0 76fd1b1c
[337685.717080] ---[ end trace 660d072e57b3d168 ]---
[337685.720698] page_copy 00000000: 6c 04 00 40 84 1f f4 ef dc aa c4 80 5c e6 9b ed  l..@........\...
[337685.720713] page_copy 00000010: 00 00 00 00 00 00 00 00 ff ff ff ff 02 00 00 00  ................
[337685.720721] *page     00000000: 6c 04 00 40 84 1f f4 ef dc aa c4 80 5c e6 9b ed  l..@........\...
[337685.720728] *page     00000010: 00 00 00 00 00 00 00 00 ff ff ff ff 02 00 00 00  ................

The scenario with dropping caches via /proc/sys/vm/drop_caches seems to be rarely used (and tested), however I suspect that during dropping caches, some pages might be unmapped too early causing other hard-to notice problems.

As my findings are quite worrisome, I'd be glad to hear if my analysis is valid, and if optimization made by Eric can be considered 'the fix', or maybe it is too optimistic.
Also as I'd like to investigate the issue further, I'd be glad for any hints on where to continue.

[1] https://lkml.org/lkml/2019/2/26/686, merged as 6daef95b8c91 in master

Patch containing debug logs:

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 8be175df3075..ded7e073434c 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -815,14 +815,25 @@ bool _copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i)
 }
 EXPORT_SYMBOL(_copy_from_iter_full_nocache);
 
-static inline bool page_copy_sane(struct page *page, size_t offset, size_t n)
+static noinline bool page_copy_sane(struct page *page, size_t offset, size_t n)
 {
+       struct page page_copy = *page;
        struct page *head = compound_head(page);
-       size_t v = n + offset + page_address(page) - page_address(head);
+       size_t page_addr = (size_t) page_address(page);
+       size_t head_addr = (size_t) page_address(head);
+       size_t v = n + offset + page_addr - head_addr;
+       unsigned int order = compound_order(head);
 
-       if (likely(n <= v && v <= (PAGE_SIZE << compound_order(head))))
+       if (likely(n <= v && v <= (PAGE_SIZE << order)))
                return true;
-       WARN_ON(1);
+       WARN(1, "%s: page=%p, offset=%zu, n=%zu, v=%zu, head=%p, page_address=%zu,"
+               " head_address=%zu, compound_order(head)=%u, page_address(page)=%zu,"
+               " page_address(head)=%zu page_address(page_copy)=%zu, flags(page)=%pGp, flags(page_copy)=%pGp\n",
+            __func__, page, offset, n, v, head, page_addr, head_addr, order,
+           (size_t) page_address(page), (size_t) page_address(head),
+           (size_t) page_address(&page_copy), &page->flags, &page_copy.flags);
+       print_hex_dump_bytes("page_copy ", DUMP_PREFIX_OFFSET, &page_copy, sizeof(page_copy));
+       print_hex_dump_bytes("*page     ", DUMP_PREFIX_OFFSET, page, sizeof(*page));
        return false;
 }

-- Pozdrawiam/With kind regards, Lech Perczak Sr. Software Engineer Camlin Technologies Poland Limited Sp. z o.o.


2019-04-26 08:44:27

by Lech Perczak

[permalink] [raw]
Subject: Re: Recurring warning in page_copy_sane (inside copy_page_to_iter) when running stress tests involving drop_caches

Cc [email protected]

W dniu 25.04.2019 o 11:25, Lech Perczak pisze:
> Hello,
>
> Some time ago, after upgrading the Kernel on our i.MX6Q-based boards to mainline 4.18, and now to LTS 4.19 line, during stress tests we started noticing strange warnings coming from 'read' syscall, when page_copy_sane() check failed. Typical reproducibility is up to ~4 events per 24h. Warnings origin from different processes, mostly involved with the stress tests, but not necessarily with block devices we're stressing. If the warning appeared in process relating to block device stress test, it would be accompanied by corrupted data, as the read operation gets aborted.
> Our reason for dropping caches is to ensure that the actual block device gets accessed during the test without resorting to O_DIRECT.
> When dropping caches was disabled in the tests, the issue would also disappear, at least in a single-week run.
>
> Example processes caught in the last run (grepped from stacktraces):
> [38128.430694] CPU: 2 PID: 32735 Comm: glmark2 Not tainted 4.19.32-devboardimx6q+g0a64e3717985 #1
> [44168.081821] CPU: 2 PID: 21555 Comm: cat Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
> [75879.424076] CPU: 3 PID: 1478 Comm: flashcp Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
> [117060.951333] CPU: 1 PID: 14150 Comm: cat Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
> [201668.438219] CPU: 1 PID: 14370 Comm: glmark2 Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
> [214332.212960] CPU: 0 PID: 25633 Comm: cp Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
> [283484.619058] CPU: 0 PID: 5091 Comm: glmark2 Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
> [330256.709124] CPU: 2 PID: 4565 Comm: sensors Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
> [337685.549556] CPU: 2 PID: 30132 Comm: sensors Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
>
> When I started debugging the issue, I noticed that in all cases we're dealing with highmem zero-order pages. In this case, page_head(page) == page, so page_address(page) should be equal to page_address(head).
> However, it isn't the case, as page_address(head) in each case returns zero, causing the value of "v" to explode, and the check to fail.
>
> At first I thought, that it could be caused by modification of struct page itself, but it turned out to not be the case - sometimes struct page would be unmodified during duration of the check. With lowmem pages, this wouldn't also happen, because page_address is derived directly from struct page pointer itself.
>
> Finally, after gathering a few traces with added debug logs I decided to trigger a panic on the warning and capture a vmcore.
> When analyzing the vmcore, I confirmed that the struct page_address_slot corresponding to the struct page had an empty list of mappings, explaining NULL returned from page_address(head) in page_copy_sane().
> Yeah, I had to manually calculate hash of struct page pointer ;)
>
> In the meantime I noticed a patch by Eric Dumazet ("iov_iter: optimize page_copy_sane()") [1]. Applying this patch silenced the warnings, at least for test run going for over a week. This is quite expected, as with this change 'page' isn't dereferenced at all, nor is page_address() called. Here I suspect that applying this patch may only hide a real issue I'm facing, and might be missing the case of highmem pages as well.
>
> An example stacktrace I captured, with added debug logs and copy of struct page before and after the check in page_copy_sane:
>
> This last one shows no changes to struct page over the call to page_copy_sane(), where first call to page_address(page) returns a valid page address (page_address=2146996224) and second call returns 0 (head_address=0). Subsequent calls to page_address inside WARN() argument list also return 0. The same was true for the occurence I captured vmcore for.
>
> (For your reference, a patch containing my debug prints is at the end of the message)
>
> [337685.344204] ------------[ cut here ]------------
> [337685.356870] WARNING: CPU: 0 PID: 30132 at /mnt/work/git/platforms-manifest/build_devboardimx6q/tmp/work-shared/devboardimx6q/kernel-source/lib/iov_iter.c:834 page_copy_sane+0x13c/0x178
> [337685.420706] page_copy_sane: page=7721123e, offset=0, n=4096, v=2147000320, head=7721123e, page_address=2146996224, head_address=0, compound_order(head)=0, page_address(page)=0, page_address(head)=0 page_address(page_copy)=0, flags(page)=referenced|uptodate|lru|active|arch_1, flags(page_copy)=referenced|uptodate|lru|active|arch_1
> [337685.494736] Modules linked in: xt_nat iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_tables x_tables usb_f_mass_storage usb_f_rndis u_ether ci_hdrc_imx ci_hdrc usbmisc_imx ulpi libcomposite configfs udc_core
> [337685.549556] CPU: 2 PID: 30132 Comm: sensors Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
> [337685.558314] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
> [337685.563600] [<8010fe44>] (unwind_backtrace) from [<8010b840>] (show_stack+0x10/0x14)
> [337685.570081] [<8010b840>] (show_stack) from [<807a5958>] (dump_stack+0x88/0x9c)
> [337685.576033] [<807a5958>] (dump_stack) from [<8011f500>] (__warn+0xfc/0x114)
> [337685.581721] [<8011f500>] (__warn) from [<8011f560>] (warn_slowpath_fmt+0x48/0x6c)
> [337685.587936] [<8011f560>] (warn_slowpath_fmt) from [<803a0664>] (page_copy_sane+0x13c/0x178)
> [337685.595004] [<803a0664>] (page_copy_sane) from [<803a2fe4>] (copy_page_to_iter+0x18/0x474)
> [337685.602020] [<803a2fe4>] (copy_page_to_iter) from [<801bd9d4>] (generic_file_read_iter+0x2cc/0x980)
> [337685.609794] [<801bd9d4>] (generic_file_read_iter) from [<8020949c>] (__vfs_read+0xf8/0x158)
> [337685.616866] [<8020949c>] (__vfs_read) from [<80209588>] (vfs_read+0x8c/0x118)
> [337685.622717] [<80209588>] (vfs_read) from [<80209aac>] (ksys_read+0x4c/0xac)
> [337685.628396] [<80209aac>] (ksys_read) from [<80101000>] (ret_fast_syscall+0x0/0x54)
> [337685.634668] Exception stack(0xa4e79fa8 to 0xa4e79ff0)
> [337685.638427] 9fa0:                   00000008 00000000 00000003 7ed03208 00000200 00000000
> [337685.645318] 9fc0: 00000008 00000000 76fe8958 00000003 00000000 7ed03208 76fe5640 7ed031b4
> [337685.652205] 9fe0: 76fe7cf0 7ed0316c 76fbeee0 76fd1b1c
> [337685.717080] ---[ end trace 660d072e57b3d168 ]---
> [337685.720698] page_copy 00000000: 6c 04 00 40 84 1f f4 ef dc aa c4 80 5c e6 9b ed  l..@........\...
> [337685.720713] page_copy 00000010: 00 00 00 00 00 00 00 00 ff ff ff ff 02 00 00 00  ................
> [337685.720721] *page     00000000: 6c 04 00 40 84 1f f4 ef dc aa c4 80 5c e6 9b ed  l..@........\...
> [337685.720728] *page     00000010: 00 00 00 00 00 00 00 00 ff ff ff ff 02 00 00 00  ................
>
> The scenario with dropping caches via /proc/sys/vm/drop_caches seems to be rarely used (and tested), however I suspect that during dropping caches, some pages might be unmapped too early causing other hard-to notice problems.
>
> As my findings are quite worrisome, I'd be glad to hear if my analysis is valid, and if optimization made by Eric can be considered 'the fix', or maybe it is too optimistic.
> Also as I'd like to investigate the issue further, I'd be glad for any hints on where to continue.
>
> [1] https://lkml.org/lkml/2019/2/26/686, merged as 6daef95b8c91 in master
>
> Patch containing debug logs:
>
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 8be175df3075..ded7e073434c 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -815,14 +815,25 @@ bool _copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i)
>  }
>  EXPORT_SYMBOL(_copy_from_iter_full_nocache);
>  
> -static inline bool page_copy_sane(struct page *page, size_t offset, size_t n)
> +static noinline bool page_copy_sane(struct page *page, size_t offset, size_t n)
>  {
> +       struct page page_copy = *page;
>         struct page *head = compound_head(page);
> -       size_t v = n + offset + page_address(page) - page_address(head);
> +       size_t page_addr = (size_t) page_address(page);
> +       size_t head_addr = (size_t) page_address(head);
> +       size_t v = n + offset + page_addr - head_addr;
> +       unsigned int order = compound_order(head);
>  
> -       if (likely(n <= v && v <= (PAGE_SIZE << compound_order(head))))
> +       if (likely(n <= v && v <= (PAGE_SIZE << order)))
>                 return true;
> -       WARN_ON(1);
> +       WARN(1, "%s: page=%p, offset=%zu, n=%zu, v=%zu, head=%p, page_address=%zu,"
> +               " head_address=%zu, compound_order(head)=%u, page_address(page)=%zu,"
> +               " page_address(head)=%zu page_address(page_copy)=%zu, flags(page)=%pGp, flags(page_copy)=%pGp\n",
> +            __func__, page, offset, n, v, head, page_addr, head_addr, order,
> +           (size_t) page_address(page), (size_t) page_address(head),
> +           (size_t) page_address(&page_copy), &page->flags, &page_copy.flags);
> +       print_hex_dump_bytes("page_copy ", DUMP_PREFIX_OFFSET, &page_copy, sizeof(page_copy));
> +       print_hex_dump_bytes("*page     ", DUMP_PREFIX_OFFSET, page, sizeof(*page));
>         return false;
>  }
>
> -- Pozdrawiam/With kind regards, Lech Perczak Sr. Software Engineer Camlin Technologies Poland Limited Sp. z o.o.
>
--
Pozdrawiam/With kind regards,
Lech Perczak

Sr. Software Engineer
Camlin Technologies Poland Limited Sp. z o.o.

2019-05-15 13:20:43

by Lech Perczak

[permalink] [raw]
Subject: Re: Recurring warning in page_copy_sane (inside copy_page_to_iter) when running stress tests involving drop_caches

Hi,

W dniu 26.04.2019 o 10:42, Lech Perczak pisze:
> Cc [email protected]
>
> W dniu 25.04.2019 o 11:25, Lech Perczak pisze:
>> Hello,
>>
>> Some time ago, after upgrading the Kernel on our i.MX6Q-based boards to mainline 4.18, and now to LTS 4.19 line, during stress tests we started noticing strange warnings coming from 'read' syscall, when page_copy_sane() check failed. Typical reproducibility is up to ~4 events per 24h. Warnings origin from different processes, mostly involved with the stress tests, but not necessarily with block devices we're stressing. If the warning appeared in process relating to block device stress test, it would be accompanied by corrupted data, as the read operation gets aborted.
>> Our reason for dropping caches is to ensure that the actual block device gets accessed during the test without resorting to O_DIRECT.
>> When dropping caches was disabled in the tests, the issue would also disappear, at least in a single-week run.
>>
>> Example processes caught in the last run (grepped from stacktraces):
>> [38128.430694] CPU: 2 PID: 32735 Comm: glmark2 Not tainted 4.19.32-devboardimx6q+g0a64e3717985 #1
>> [44168.081821] CPU: 2 PID: 21555 Comm: cat Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
>> [75879.424076] CPU: 3 PID: 1478 Comm: flashcp Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
>> [117060.951333] CPU: 1 PID: 14150 Comm: cat Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
>> [201668.438219] CPU: 1 PID: 14370 Comm: glmark2 Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
>> [214332.212960] CPU: 0 PID: 25633 Comm: cp Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
>> [283484.619058] CPU: 0 PID: 5091 Comm: glmark2 Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
>> [330256.709124] CPU: 2 PID: 4565 Comm: sensors Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
>> [337685.549556] CPU: 2 PID: 30132 Comm: sensors Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
>>
>> When I started debugging the issue, I noticed that in all cases we're dealing with highmem zero-order pages. In this case, page_head(page) == page, so page_address(page) should be equal to page_address(head).
>> However, it isn't the case, as page_address(head) in each case returns zero, causing the value of "v" to explode, and the check to fail.
>>
>> At first I thought, that it could be caused by modification of struct page itself, but it turned out to not be the case - sometimes struct page would be unmodified during duration of the check. With lowmem pages, this wouldn't also happen, because page_address is derived directly from struct page pointer itself.
>>
>> Finally, after gathering a few traces with added debug logs I decided to trigger a panic on the warning and capture a vmcore.
>> When analyzing the vmcore, I confirmed that the struct page_address_slot corresponding to the struct page had an empty list of mappings, explaining NULL returned from page_address(head) in page_copy_sane().
>> Yeah, I had to manually calculate hash of struct page pointer ;)
>>
>> In the meantime I noticed a patch by Eric Dumazet ("iov_iter: optimize page_copy_sane()") [1]. Applying this patch silenced the warnings, at least for test run going for over a week. This is quite expected, as with this change 'page' isn't dereferenced at all, nor is page_address() called. Here I suspect that applying this patch may only hide a real issue I'm facing, and might be missing the case of highmem pages as well.
>>
>> An example stacktrace I captured, with added debug logs and copy of struct page before and after the check in page_copy_sane:
>>
>> This last one shows no changes to struct page over the call to page_copy_sane(), where first call to page_address(page) returns a valid page address (page_address=2146996224) and second call returns 0 (head_address=0). Subsequent calls to page_address inside WARN() argument list also return 0. The same was true for the occurence I captured vmcore for.
>>
>> (For your reference, a patch containing my debug prints is at the end of the message)
>>
>> [337685.344204] ------------[ cut here ]------------
>> [337685.356870] WARNING: CPU: 0 PID: 30132 at /mnt/work/git/platforms-manifest/build_devboardimx6q/tmp/work-shared/devboardimx6q/kernel-source/lib/iov_iter.c:834 page_copy_sane+0x13c/0x178
>> [337685.420706] page_copy_sane: page=7721123e, offset=0, n=4096, v=2147000320, head=7721123e, page_address=2146996224, head_address=0, compound_order(head)=0, page_address(page)=0, page_address(head)=0 page_address(page_copy)=0, flags(page)=referenced|uptodate|lru|active|arch_1, flags(page_copy)=referenced|uptodate|lru|active|arch_1
>> [337685.494736] Modules linked in: xt_nat iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_tables x_tables usb_f_mass_storage usb_f_rndis u_ether ci_hdrc_imx ci_hdrc usbmisc_imx ulpi libcomposite configfs udc_core
>> [337685.549556] CPU: 2 PID: 30132 Comm: sensors Tainted: G        W         4.19.32-devboardimx6q+g0a64e3717985 #1
>> [337685.558314] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
>> [337685.563600] [<8010fe44>] (unwind_backtrace) from [<8010b840>] (show_stack+0x10/0x14)
>> [337685.570081] [<8010b840>] (show_stack) from [<807a5958>] (dump_stack+0x88/0x9c)
>> [337685.576033] [<807a5958>] (dump_stack) from [<8011f500>] (__warn+0xfc/0x114)
>> [337685.581721] [<8011f500>] (__warn) from [<8011f560>] (warn_slowpath_fmt+0x48/0x6c)
>> [337685.587936] [<8011f560>] (warn_slowpath_fmt) from [<803a0664>] (page_copy_sane+0x13c/0x178)
>> [337685.595004] [<803a0664>] (page_copy_sane) from [<803a2fe4>] (copy_page_to_iter+0x18/0x474)
>> [337685.602020] [<803a2fe4>] (copy_page_to_iter) from [<801bd9d4>] (generic_file_read_iter+0x2cc/0x980)
>> [337685.609794] [<801bd9d4>] (generic_file_read_iter) from [<8020949c>] (__vfs_read+0xf8/0x158)
>> [337685.616866] [<8020949c>] (__vfs_read) from [<80209588>] (vfs_read+0x8c/0x118)
>> [337685.622717] [<80209588>] (vfs_read) from [<80209aac>] (ksys_read+0x4c/0xac)
>> [337685.628396] [<80209aac>] (ksys_read) from [<80101000>] (ret_fast_syscall+0x0/0x54)
>> [337685.634668] Exception stack(0xa4e79fa8 to 0xa4e79ff0)
>> [337685.638427] 9fa0:                   00000008 00000000 00000003 7ed03208 00000200 00000000
>> [337685.645318] 9fc0: 00000008 00000000 76fe8958 00000003 00000000 7ed03208 76fe5640 7ed031b4
>> [337685.652205] 9fe0: 76fe7cf0 7ed0316c 76fbeee0 76fd1b1c
>> [337685.717080] ---[ end trace 660d072e57b3d168 ]---
>> [337685.720698] page_copy 00000000: 6c 04 00 40 84 1f f4 ef dc aa c4 80 5c e6 9b ed  l..@........\...
>> [337685.720713] page_copy 00000010: 00 00 00 00 00 00 00 00 ff ff ff ff 02 00 00 00  ................
>> [337685.720721] *page     00000000: 6c 04 00 40 84 1f f4 ef dc aa c4 80 5c e6 9b ed  l..@........\...
>> [337685.720728] *page     00000010: 00 00 00 00 00 00 00 00 ff ff ff ff 02 00 00 00  ................
>>
>> The scenario with dropping caches via /proc/sys/vm/drop_caches seems to be rarely used (and tested), however I suspect that during dropping caches, some pages might be unmapped too early causing other hard-to notice problems.
>>
>> As my findings are quite worrisome, I'd be glad to hear if my analysis is valid, and if optimization made by Eric can be considered 'the fix', or maybe it is too optimistic.
>> Also as I'd like to investigate the issue further, I'd be glad for any hints on where to continue.
>>
>> [1] https://lkml.org/lkml/2019/2/26/686, merged as 6daef95b8c91 in master
>>
>> Patch containing debug logs:
>>
>> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
>> index 8be175df3075..ded7e073434c 100644
>> --- a/lib/iov_iter.c
>> +++ b/lib/iov_iter.c
>> @@ -815,14 +815,25 @@ bool _copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i)
>>  }
>>  EXPORT_SYMBOL(_copy_from_iter_full_nocache);
>>  
>> -static inline bool page_copy_sane(struct page *page, size_t offset, size_t n)
>> +static noinline bool page_copy_sane(struct page *page, size_t offset, size_t n)
>>  {
>> +       struct page page_copy = *page;
>>         struct page *head = compound_head(page);
>> -       size_t v = n + offset + page_address(page) - page_address(head);
>> +       size_t page_addr = (size_t) page_address(page);
>> +       size_t head_addr = (size_t) page_address(head);
>> +       size_t v = n + offset + page_addr - head_addr;
>> +       unsigned int order = compound_order(head);
>>  
>> -       if (likely(n <= v && v <= (PAGE_SIZE << compound_order(head))))
>> +       if (likely(n <= v && v <= (PAGE_SIZE << order)))
>>                 return true;
>> -       WARN_ON(1);
>> +       WARN(1, "%s: page=%p, offset=%zu, n=%zu, v=%zu, head=%p, page_address=%zu,"
>> +               " head_address=%zu, compound_order(head)=%u, page_address(page)=%zu,"
>> +               " page_address(head)=%zu page_address(page_copy)=%zu, flags(page)=%pGp, flags(page_copy)=%pGp\n",
>> +            __func__, page, offset, n, v, head, page_addr, head_addr, order,
>> +           (size_t) page_address(page), (size_t) page_address(head),
>> +           (size_t) page_address(&page_copy), &page->flags, &page_copy.flags);
>> +       print_hex_dump_bytes("page_copy ", DUMP_PREFIX_OFFSET, &page_copy, sizeof(page_copy));
>> +       print_hex_dump_bytes("*page     ", DUMP_PREFIX_OFFSET, page, sizeof(*page));
>>         return false;
>>  }
>>
>> -- Pozdrawiam/With kind regards, Lech Perczak Sr. Software Engineer Camlin Technologies Poland Limited Sp. z o.o.
>>
As my question went in during Linux MM & FS summit, it might have gone unnoticed. So just a kind reminder :)
Maybe someone got a chance to look at it?

--
Pozdrawiam/With kind regards,
Lech Perczak

Sr. Software Engineer
Camlin Technologies Poland Limited Sp. z o.o.

2019-05-15 14:45:35

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Recurring warning in page_copy_sane (inside copy_page_to_iter) when running stress tests involving drop_caches

> > W dniu 25.04.2019 o?11:25, Lech Perczak pisze:
> >> Some time ago, after upgrading the Kernel on our i.MX6Q-based boards to mainline 4.18, and now to LTS 4.19 line, during stress tests we started noticing strange warnings coming from 'read' syscall, when page_copy_sane() check failed. Typical reproducibility is up to ~4 events per 24h. Warnings origin from different processes, mostly involved with the stress tests, but not necessarily with block devices we're stressing. If the warning appeared in process relating to block device stress test, it would be accompanied by corrupted data, as the read operation gets aborted.
> >>
> >> When I started debugging the issue, I noticed that in all cases we're dealing with highmem zero-order pages. In this case, page_head(page) == page, so page_address(page) should be equal to page_address(head).
> >> However, it isn't the case, as page_address(head) in each case returns zero, causing the value of "v" to explode, and the check to fail.

You're seeing a race between page_address(page) being called twice.
Between those two calls, something has caused the page to be removed from
the page_address_map() list. Eric's patch avoids calling page_address(),
so apply it and be happy.

Greg, can you consider 6daef95b8c914866a46247232a048447fff97279 for
backporting to stable? Nobody realised it was a bugfix at the time it
went in. I suspect there aren't too many of us running HIGHMEM kernels
any more.

2019-05-15 15:04:22

by Eric Dumazet

[permalink] [raw]
Subject: Re: Recurring warning in page_copy_sane (inside copy_page_to_iter) when running stress tests involving drop_caches

On Wed, May 15, 2019 at 7:43 AM Matthew Wilcox <[email protected]> wrote:
>
> > > W dniu 25.04.2019 o 11:25, Lech Perczak pisze:
> > >> Some time ago, after upgrading the Kernel on our i.MX6Q-based boards to mainline 4.18, and now to LTS 4.19 line, during stress tests we started noticing strange warnings coming from 'read' syscall, when page_copy_sane() check failed. Typical reproducibility is up to ~4 events per 24h. Warnings origin from different processes, mostly involved with the stress tests, but not necessarily with block devices we're stressing. If the warning appeared in process relating to block device stress test, it would be accompanied by corrupted data, as the read operation gets aborted.
> > >>
> > >> When I started debugging the issue, I noticed that in all cases we're dealing with highmem zero-order pages. In this case, page_head(page) == page, so page_address(page) should be equal to page_address(head).
> > >> However, it isn't the case, as page_address(head) in each case returns zero, causing the value of "v" to explode, and the check to fail.
>
> You're seeing a race between page_address(page) being called twice.
> Between those two calls, something has caused the page to be removed from
> the page_address_map() list. Eric's patch avoids calling page_address(),
> so apply it and be happy.

Hmm... wont the kmap_atomic() done later, after page_copy_sane() would
suffer from the race ?

It seems there is a real bug somewhere to fix.

>
> Greg, can you consider 6daef95b8c914866a46247232a048447fff97279 for
> backporting to stable? Nobody realised it was a bugfix at the time it
> went in. I suspect there aren't too many of us running HIGHMEM kernels
> any more.
>

2019-05-15 15:06:07

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: Recurring warning in page_copy_sane (inside copy_page_to_iter) when running stress tests involving drop_caches

On Wed, May 15, 2019 at 07:43:52AM -0700, Matthew Wilcox wrote:
> > > W dniu 25.04.2019 o?11:25, Lech Perczak pisze:
> > >> Some time ago, after upgrading the Kernel on our i.MX6Q-based boards to mainline 4.18, and now to LTS 4.19 line, during stress tests we started noticing strange warnings coming from 'read' syscall, when page_copy_sane() check failed. Typical reproducibility is up to ~4 events per 24h. Warnings origin from different processes, mostly involved with the stress tests, but not necessarily with block devices we're stressing. If the warning appeared in process relating to block device stress test, it would be accompanied by corrupted data, as the read operation gets aborted.
> > >>
> > >> When I started debugging the issue, I noticed that in all cases we're dealing with highmem zero-order pages. In this case, page_head(page) == page, so page_address(page) should be equal to page_address(head).
> > >> However, it isn't the case, as page_address(head) in each case returns zero, causing the value of "v" to explode, and the check to fail.
>
> You're seeing a race between page_address(page) being called twice.
> Between those two calls, something has caused the page to be removed from
> the page_address_map() list. Eric's patch avoids calling page_address(),
> so apply it and be happy.
>
> Greg, can you consider 6daef95b8c914866a46247232a048447fff97279 for
> backporting to stable? Nobody realised it was a bugfix at the time it
> went in. I suspect there aren't too many of us running HIGHMEM kernels
> any more.
>

Sure, what kernel version(s) should this go to? 4.19 and newer?

thanks,

greg k-h

2019-05-15 15:19:41

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Recurring warning in page_copy_sane (inside copy_page_to_iter) when running stress tests involving drop_caches

On Wed, May 15, 2019 at 08:02:17AM -0700, Eric Dumazet wrote:
> On Wed, May 15, 2019 at 7:43 AM Matthew Wilcox <[email protected]> wrote:
> > You're seeing a race between page_address(page) being called twice.
> > Between those two calls, something has caused the page to be removed from
> > the page_address_map() list. Eric's patch avoids calling page_address(),
> > so apply it and be happy.
>
> Hmm... wont the kmap_atomic() done later, after page_copy_sane() would
> suffer from the race ?
>
> It seems there is a real bug somewhere to fix.

No. page_address() called before the kmap_atomic() will look through
the list of mappings and see if that page is mapped somewhere. We unmap
lazily, so all it takes to trigger this race is that the page _has_
been mapped before, and its mapping gets torn down during this call.

While the page is kmapped, its mapping cannot be torn down.

2019-05-15 15:23:46

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Recurring warning in page_copy_sane (inside copy_page_to_iter) when running stress tests involving drop_caches

On Wed, May 15, 2019 at 05:04:06PM +0200, Greg Kroah-Hartman wrote:
> > Greg, can you consider 6daef95b8c914866a46247232a048447fff97279 for
> > backporting to stable? Nobody realised it was a bugfix at the time it
> > went in. I suspect there aren't too many of us running HIGHMEM kernels
> > any more.
> >
>
> Sure, what kernel version(s) should this go to? 4.19 and newer?

Looks like the problem was introduced with commit
a90bcb86ae700c12432446c4aa1819e7b8e172ec so 4.14 and newer, I think.

2019-05-15 15:39:06

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: Recurring warning in page_copy_sane (inside copy_page_to_iter) when running stress tests involving drop_caches

On Wed, May 15, 2019 at 08:20:35AM -0700, Matthew Wilcox wrote:
> On Wed, May 15, 2019 at 05:04:06PM +0200, Greg Kroah-Hartman wrote:
> > > Greg, can you consider 6daef95b8c914866a46247232a048447fff97279 for
> > > backporting to stable? Nobody realised it was a bugfix at the time it
> > > went in. I suspect there aren't too many of us running HIGHMEM kernels
> > > any more.
> > >
> >
> > Sure, what kernel version(s) should this go to? 4.19 and newer?
>
> Looks like the problem was introduced with commit
> a90bcb86ae700c12432446c4aa1819e7b8e172ec so 4.14 and newer, I think.

Ok, I'll look into it after this round of stable kernels are released,
thanks.

greg k-h

2019-05-16 15:21:25

by Lech Perczak

[permalink] [raw]
Subject: Re: Recurring warning in page_copy_sane (inside copy_page_to_iter) when running stress tests involving drop_caches


W dniu 15.05.2019 o 17:18, Matthew Wilcox pisze:
> On Wed, May 15, 2019 at 08:02:17AM -0700, Eric Dumazet wrote:
>> On Wed, May 15, 2019 at 7:43 AM Matthew Wilcox <[email protected]> wrote:
>>> You're seeing a race between page_address(page) being called twice.
>>> Between those two calls, something has caused the page to be removed from
>>> the page_address_map() list. Eric's patch avoids calling page_address(),
>>> so apply it and be happy.
>> Hmm... wont the kmap_atomic() done later, after page_copy_sane() would
>> suffer from the race ?
>>
>> It seems there is a real bug somewhere to fix.
> No. page_address() called before the kmap_atomic() will look through
> the list of mappings and see if that page is mapped somewhere. We unmap
> lazily, so all it takes to trigger this race is that the page _has_
> been mapped before, and its mapping gets torn down during this call.
>
> While the page is kmapped, its mapping cannot be torn down.

And that's the answer I'm really glad to hear.
In the meantime I've set up a test run with CONFIG_HIGHMEM disabled
to be extra sure, however quite expectedly it robs the system of 256MiB
of accessible memory.

Unfortunatly, looking through defconfigs for 32-bit ARM, CONFIG_HIGHMEM
is enabled in quite a few of them, i.MX included.

I'll pick up the patch then and drop it when it gets included in 4.19.y.
Thanks!

--
With kind regards,
Lech

2019-05-20 11:46:40

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: Recurring warning in page_copy_sane (inside copy_page_to_iter) when running stress tests involving drop_caches

On Wed, May 15, 2019 at 08:20:35AM -0700, Matthew Wilcox wrote:
> On Wed, May 15, 2019 at 05:04:06PM +0200, Greg Kroah-Hartman wrote:
> > > Greg, can you consider 6daef95b8c914866a46247232a048447fff97279 for
> > > backporting to stable? Nobody realised it was a bugfix at the time it
> > > went in. I suspect there aren't too many of us running HIGHMEM kernels
> > > any more.
> > >
> >
> > Sure, what kernel version(s) should this go to? 4.19 and newer?
>
> Looks like the problem was introduced with commit
> a90bcb86ae700c12432446c4aa1819e7b8e172ec so 4.14 and newer, I think.

Thanks, now queued up.

greg k-h