2015-07-18 18:04:24

by Dixit, Ashutosh

[permalink] [raw]
Subject: Regression in v4.2-rc1: vmalloc_to_page with ioremap

vmalloc_to_page with ioremap'd memory used to work correctly till v4.1. In
v4.2-rc1 when ioremap is done using huge pages vmalloc_to_page on ioremap'd
memory crashes. Are there plans to fix this?

An example of the use of vmalloc_to_page with ioremap is in the Intel MIC SCIF
driver (drivers/misc/mic/scif/scif_nodeqp.c) which dma map's the ioremap'd range
of one device to a second device.

Thanks,
Ashutosh


2015-07-20 15:56:04

by Toshi Kani

[permalink] [raw]
Subject: Re: Regression in v4.2-rc1: vmalloc_to_page with ioremap

On Sat, 2015-07-18 at 18:04 +0000, Dixit, Ashutosh wrote:
> vmalloc_to_page with ioremap'd memory used to work correctly till
> v4.1. In
> v4.2-rc1 when ioremap is done using huge pages vmalloc_to_page on
> ioremap'd
> memory crashes. Are there plans to fix this?
>
> An example of the use of vmalloc_to_page with ioremap is in the Intel
> MIC SCIF
> driver (drivers/misc/mic/scif/scif_nodeqp.c) which dma map's the
> ioremap'd range
> of one device to a second device.

Yes, vmalloc_to_page() assumes 4KB mappings. But ioremap with huge
pages was enabled in v4.1, while v4.2-rc1 has update for the check with
MTRRs. Can you send me outputs of the following files? If the driver
fails to load in v4.2-rc1, you can obtain the info in v4.1.

/proc/mtrr
/proc/iomem
/proc/vmallocinfo
/sys/kernel/debug/kernel_page_tables (need CONFIG_X86_PTDUMP set)

Also, does the driver map a regular memory range with ioremap? If not,
how does 'struct page' get allocated for the range (since
vmalloc_to_page returns a page pointer)?

Thanks,
-Toshi

2015-07-20 16:00:11

by Toshi Kani

[permalink] [raw]
Subject: Re: Regression in v4.2-rc1: vmalloc_to_page with ioremap

On Mon, 2015-07-20 at 09:54 -0600, Toshi Kani wrote:
> On Sat, 2015-07-18 at 18:04 +0000, Dixit, Ashutosh wrote:
> > vmalloc_to_page with ioremap'd memory used to work correctly till
> > v4.1. In
> > v4.2-rc1 when ioremap is done using huge pages vmalloc_to_page on
> > ioremap'd
> > memory crashes. Are there plans to fix this?
> >
> > An example of the use of vmalloc_to_page with ioremap is in the
> > Intel
> > MIC SCIF
> > driver (drivers/misc/mic/scif/scif_nodeqp.c) which dma map's the
> > ioremap'd range
> > of one device to a second device.
>
> Yes, vmalloc_to_page() assumes 4KB mappings. But ioremap with huge
> pages was enabled in v4.1, while v4.2-rc1 has update for the check
> with
> MTRRs. Can you send me outputs of the following files? If the
> driver
> fails to load in v4.2-rc1, you can obtain the info in v4.1.
>
> /proc/mtrr
> /proc/iomem
> /proc/vmallocinfo
> /sys/kernel/debug/kernel_page_tables (need CONFIG_X86_PTDUMP set)
>
> Also, does the driver map a regular memory range with ioremap? If
> not, how does 'struct page' get allocated for the range (since
> vmalloc_to_page returns a page pointer)?

Can you also try the 'nohugeiomap' kernel option to the 4.2-rc1 kernel
to see if the problem goes away?

Thanks,
-Toshi

2015-07-20 18:33:07

by Dixit, Ashutosh

[permalink] [raw]
Subject: Re: Regression in v4.2-rc1: vmalloc_to_page with ioremap

On Mon, Jul 20 2015 at 08:59:02 AM, Toshi Kani <[email protected]> wrote:

> Can you also try the 'nohugeiomap' kernel option to the 4.2-rc1 kernel
> to see if the problem goes away?

Yes the problem goes away with 'nohugeiomap'.

> Yes, vmalloc_to_page() assumes 4KB mappings. But ioremap with huge
> pages was enabled in v4.1, while v4.2-rc1 has update for the check with
> MTRRs.

Yes we had bisected the failure down to the following commit, so the
problem goes away if this commit is reverted. So we think that without this
commit ioremap actually maps device memory in 4K pages and that is why we
don't see the issue.

commit b73522e0c1be58d3c69b124985b8ccf94e3677f7
Author: Toshi Kani <[email protected]>
Date: Tue May 26 10:28:10 2015 +0200

x86/mm/mtrr: Enhance MTRR checks in kernel mapping helpers

> Can you send me outputs of the following files? If the driver
> fails to load in v4.2-rc1, you can obtain the info in v4.1.
>
> /proc/mtrr
> /proc/iomem
> /proc/vmallocinfo
> /sys/kernel/debug/kernel_page_tables (need CONFIG_X86_PTDUMP set)

Since the outputs are large I have sent you the outputs in a separate mail
outside the mailing list.

> Also, does the driver map a regular memory range with ioremap? If not,
> how does 'struct page' get allocated for the range (since
> vmalloc_to_page returns a page pointer)?

No the driver does not map regular memory with ioremap, only device
memory. vmalloc_to_page was returning a valid 'struct page' in this case
too. It appears it can do this correctly using pte_page as long as all four
page table levels (pgd, pud, pmd, pte) are present and the problem seems to
be happening because in the case of huge pages they are not. For us the bar
size is 8 G so we think that with the new ioremap maps the bars using 1 G
pages.

Here is the call stack for the crash:

[47360.050724] BUG: unable to handle kernel paging request at ffffc47e00000000
[47360.050965] IP: [<ffffffff8119da5c>] vmalloc_to_page+0x6c/0xb0
[47360.051136] PGD 0
[47360.051288] Oops: 0000 [#1] SMP
[47360.059112] Workqueue: SCIF INTR 2 scif_intr_bh_handler [scif]
[47360.059481] task: ffff88042d659d80 ti: ffff88042d664000 task.ti: ffff88042d664000
[47360.059986] RIP: 0010:[<ffffffff8119da5c>] [<ffffffff8119da5c>] vmalloc_to_page+0x6c/0xb0
[47360.060568] RSP: 0018:ffff88042d667bb8 EFLAGS: 00010206
[47360.060863] RAX: 00003c7e00000000 RBX: ffffc90040000000 RCX: 00003ffffffff000
[47360.061165] RDX: 00003c7e00000000 RSI: ffff880000000000 RDI: ffffc90040000000
[47360.061466] RBP: ffff88042d667bb8 R08: 0000000000000000 R09: ffff880679100000
[47360.061766] R10: 0000000000000000 R11: 0000000000000008 R12: 0000000000040000
[47360.062066] R13: 0000000000008000 R14: ffff880679100000 R15: ffff880679100000
[47360.062367] FS: 0000000000000000(0000) GS:ffff88083f600000(0000) knlGS:0000000000000000
[47360.062873] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[47360.063169] CR2: ffffc47e00000000 CR3: 0000000001c0e000 CR4: 00000000000407e0
[47360.063468] Stack:
[47360.063754] ffff88042d667c08 ffffffffa054c141 0000000000000000 0000000000040000
[47360.064565] ffff88042d667be8 ffff88082a6ebd20 0000000000000020 0000000000008000
[47360.065389] ffff88067f140228 ffff88082993a000 ffff88042d667c88 ffffffffa054d068
[47360.066212] Call Trace:
[47360.066530] [<ffffffffa054c141>] scif_p2p_setsg+0xb1/0xd0 [scif]
[47360.066835] [<ffffffffa054d068>] scif_init_p2p_info+0xe8/0x290 [scif]
[47360.067171] [<ffffffffa054d37a>] scif_init+0x16a/0x2c0 [scif]
[47360.067487] [<ffffffff810a9dea>] ? pick_next_task_fair+0x42a/0x4f0
[47360.067817] [<ffffffffa054c242>] scif_nodeqp_msg_handler+0x42/0x80 [scif]
[47360.068122] [<ffffffffa054c322>] scif_nodeqp_intrhandler+0x32/0x70 [scif]
[47360.068453] [<ffffffffa0547575>] scif_intr_bh_handler+0x25/0x40 [scif]
[47360.068759] [<ffffffff8108b553>] process_one_work+0x143/0x410
[47360.069087] [<ffffffff8108b93b>] worker_thread+0x11b/0x4b0
[47360.069387] [<ffffffff816c4659>] ? __schedule+0x309/0x880
[47360.069713] [<ffffffff8108b820>] ? process_one_work+0x410/0x410
[47360.070013] [<ffffffff8108b820>] ? process_one_work+0x410/0x410
[47360.070334] [<ffffffff8109094c>] kthread+0xcc/0xf0
[47360.070639] [<ffffffff8109b9ae>] ? schedule_tail+0x1e/0xc0
[47360.070961] [<ffffffff81090880>] ? kthread_freezable_should_stop+0x70/0x70
[47360.071266] [<ffffffff816c895f>] ret_from_fork+0x3f/0x70
[47360.071584] [<ffffffff81090880>] ? kthread_freezable_should_stop+0x70/0x70
[47360.077579] RIP [<ffffffff8119da5c>] vmalloc_to_page+0x6c/0xb0
[47360.077979] RSP <ffff88042d667bb8>

And gdb points to the following source:

(gdb) list *vmalloc_to_page+0x6c
0xffffffff8119e50c is in vmalloc_to_page (mm/vmalloc.c:246).
241 */
242 VIRTUAL_BUG_ON(!is_vmalloc_or_module_addr(vmalloc_addr));
243
244 if (!pgd_none(*pgd)) {
245 pud_t *pud = pud_offset(pgd, addr);
246 if (!pud_none(*pud)) {
247 pmd_t *pmd = pmd_offset(pud, addr);
248 if (!pmd_none(*pmd)) {
249 pte_t *ptep, pte;
250

Thanks,
Ashutosh

2015-07-20 19:22:26

by Toshi Kani

[permalink] [raw]
Subject: Re: Regression in v4.2-rc1: vmalloc_to_page with ioremap

On Mon, 2015-07-20 at 11:33 -0700, Ashutosh Dixit wrote:
> On Mon, Jul 20 2015 at 08:59:02 AM, Toshi Kani <[email protected]> wrote:
>
> > Can you also try the 'nohugeiomap' kernel option to the 4.2-rc1 kernel
> > to see if the problem goes away?
>
> Yes the problem goes away with 'nohugeiomap'.
>
> > Yes, vmalloc_to_page() assumes 4KB mappings. But ioremap with huge
> > pages was enabled in v4.1, while v4.2-rc1 has update for the check with
> > MTRRs.
>
> Yes we had bisected the failure down to the following commit, so the
> problem goes away if this commit is reverted. So we think that without
> this
> commit ioremap actually maps device memory in 4K pages and that is why we
> don't see the issue.
>
> commit b73522e0c1be58d3c69b124985b8ccf94e3677f7
> Author: Toshi Kani <[email protected]>
> Date: Tue May 26 10:28:10 2015 +0200
>
> x86/mm/mtrr: Enhance MTRR checks in kernel mapping helpers

Thanks for checking into this! That makes sense. Your system has the MTRR
default type set to UC, which covers device memory ranges. In v4.1, large
page mappings are not allowed for the ranges covered by MTRRs with non-WB
types. In v4.2, large page mappings are allowed as long as the request
type does not conflict with MTRRs. So, v4.2 enabled large page mappings to
the driver.

> > Can you send me outputs of the following files? If the driver
> > fails to load in v4.2-rc1, you can obtain the info in v4.1.
> >
> > /proc/mtrr
> > /proc/iomem
> > /proc/vmallocinfo
> > /sys/kernel/debug/kernel_page_tables (need CONFIG_X86_PTDUMP set)
>
> Since the outputs are large I have sent you the outputs in a separate
> mail
> outside the mailing list.

Did you collect the info with your driver loaded? I did not see ioremap
requests from your driver in the vmallocinfo output.

> > Also, does the driver map a regular memory range with ioremap? If not,
> > how does 'struct page' get allocated for the range (since
> > vmalloc_to_page returns a page pointer)?
>
> No the driver does not map regular memory with ioremap, only device
> memory. vmalloc_to_page was returning a valid 'struct page' in this case
> too. It appears it can do this correctly using pte_page as long as all
> four
> page table levels (pgd, pud, pmd, pte) are present and the problem seems
> to
> be happening because in the case of huge pages they are not. For us the
> bar
> size is 8 G so we think that with the new ioremap maps the bars using 1 G
> pages.

Well, it is probably not a valid `struct page` pointer you got from
vmalloc_to_page... pte_page(pte) simply adds a pfn to vmemmap. So, yes,
you get a pointer, which can be put back to the pfn by subtracting vmemmap.
It may look fine, but this pointer does not point to a struct page unless
it is part of the regular memory (or you somehow allocated struct page for
the range). Can you check to see if the struct page pointer from
vmalloc_to_page actually points a struct page entry?

/* memmap is virtually contiguous. */
#define __pfn_to_page(pfn) (vmemmap + (pfn))
#define __page_to_pfn(page) (unsigned long)((page) - vmemmap)

Thanks,
-Toshi

2015-07-21 15:17:06

by Dixit, Ashutosh

[permalink] [raw]
Subject: Re: Regression in v4.2-rc1: vmalloc_to_page with ioremap

On Mon, Jul 20 2015 at 12:21:18 PM, Toshi Kani <[email protected]> wrote:
>> > Can you send me outputs of the following files? If the driver
>> > fails to load in v4.2-rc1, you can obtain the info in v4.1.
>> >
>> > /proc/mtrr
>> > /proc/iomem
>> > /proc/vmallocinfo
>> > /sys/kernel/debug/kernel_page_tables (need CONFIG_X86_PTDUMP set)
>>
>> Since the outputs are large I have sent you the outputs in a separate
>> mail outside the mailing list.
>
> Did you collect the info with your driver loaded? I did not see ioremap
> requests from your driver in the vmallocinfo output.

Sorry, yes the driver was loaded. The ioremap itself is not done by scif.ko
but by mic_host.ko and then the addresses are passed to scif.ko. So I
think what you are looking for is this:

0xffffc90040000000-0xffffc90240001000 8589938688 mic_probe+0x281/0x5c0 \
[mic_host] phys=3c7e00000000 ioremap
0xffffc90280000000-0xffffc90480001000 8589938688 mic_probe+0x281/0x5c0 \
[mic_host] phys=3c7c00000000 ioremap

>> > Also, does the driver map a regular memory range with ioremap? If not,
>> > how does 'struct page' get allocated for the range (since
>> > vmalloc_to_page returns a page pointer)?
>>
>> No the driver does not map regular memory with ioremap, only device
>> memory. vmalloc_to_page was returning a valid 'struct page' in this case
>> too. It appears it can do this correctly using pte_page as long as all
>> four
>> page table levels (pgd, pud, pmd, pte) are present and the problem seems
>> to
>> be happening because in the case of huge pages they are not. For us the
>> bar
>> size is 8 G so we think that with the new ioremap maps the bars using 1 G
>> pages.
>
> Well, it is probably not a valid `struct page` pointer you got from
> vmalloc_to_page... pte_page(pte) simply adds a pfn to vmemmap. So, yes,
> you get a pointer, which can be put back to the pfn by subtracting vmemmap.
> It may look fine, but this pointer does not point to a struct page unless
> it is part of the regular memory (or you somehow allocated struct page for
> the range). Can you check to see if the struct page pointer from
> vmalloc_to_page actually points a struct page entry?
>
> /* memmap is virtually contiguous. */
> #define __pfn_to_page(pfn) (vmemmap + (pfn))
> #define __page_to_pfn(page) (unsigned long)((page) - vmemmap)

Yes, you are correct, the 'struct page' pointer returned by
vmalloc_to_page does not point to a "real" struct page entry. Neither
have we allocated a struct page for the range. However, because we pass
the returned pointer to streaming DMA mapping API's
(dma_map_ops->dma_map_sg or dma_map_ops->dma_map_page) and all those
functions do is call page_to_phys, they only care about the physical
address, it used to work.

Would it be possible to have a different API which can do this or can
vmalloc_to_page be updated to handle huge ioremaps without crashing? Or
would you have a suggestion for doing this differently?

Thanks,
Ashutosh

2015-07-21 20:40:21

by Toshi Kani

[permalink] [raw]
Subject: Re: Regression in v4.2-rc1: vmalloc_to_page with ioremap

On Tue, 2015-07-21 at 08:17 -0700, Ashutosh Dixit wrote:
> On Mon, Jul 20 2015 at 12:21:18 PM, Toshi Kani <[email protected]> wrote:
> > > > Can you send me outputs of the following files? If the driver
> > > > fails to load in v4.2-rc1, you can obtain the info in v4.1.
> > > >
> > > > /proc/mtrr
> > > > /proc/iomem
> > > > /proc/vmallocinfo
> > > > /sys/kernel/debug/kernel_page_tables (need CONFIG_X86_PTDUMP set)
> > >
> > > Since the outputs are large I have sent you the outputs in a separate
> > > mail outside the mailing list.
> >
> > Did you collect the info with your driver loaded? I did not see
> > ioremap
> > requests from your driver in the vmallocinfo output.
>
> Sorry, yes the driver was loaded. The ioremap itself is not done by
> scif.ko
> but by mic_host.ko and then the addresses are passed to scif.ko. So I
> think what you are looking for is this:
>
> 0xffffc90040000000-0xffffc90240001000 8589938688 mic_probe+0x281/0x5c0 \
> [mic_host] phys=3c7e00000000 ioremap
> 0xffffc90280000000-0xffffc90480001000 8589938688 mic_probe+0x281/0x5c0 \
> [mic_host] phys=3c7c00000000 ioremap

OK, this confirms that they had 4KB mappings to 8GB ranges before.

0xffffc90040000000-0xffffc90240000000 8G RW PWT GLB NX pte
0xffffc90280000000-0xffffc90480000000 8G RW PWT GLB NX pte

And these ranges are from PCI MMIO.

3c7c00000000-3c7dffffffff : PCI Bus 0000:84
3c7c00000000-3c7dffffffff : 0000:84:00.0
3c7c00000000-3c7dffffffff : mic
3c7e00000000-3c7fffffffff : PCI Bus 0000:83
3c7e00000000-3c7fffffffff : 0000:83:00.0
3c7e00000000-3c7fffffffff : mic

> > > > Also, does the driver map a regular memory range with ioremap? If
> > > > not,
> > > > how does 'struct page' get allocated for the range (since
> > > > vmalloc_to_page returns a page pointer)?
> > >
> > > No the driver does not map regular memory with ioremap, only device
> > > memory. vmalloc_to_page was returning a valid 'struct page' in this
> > > case
> > > too. It appears it can do this correctly using pte_page as long as
> > > all
> > > four
> > > page table levels (pgd, pud, pmd, pte) are present and the problem
> > > seems
> > > to
> > > be happening because in the case of huge pages they are not. For us
> > > the
> > > bar
> > > size is 8 G so we think that with the new ioremap maps the bars using
> > > 1 G
> > > pages.
> >
> > Well, it is probably not a valid `struct page` pointer you got from
> > vmalloc_to_page... pte_page(pte) simply adds a pfn to vmemmap. So,
> > yes,
> > you get a pointer, which can be put back to the pfn by subtracting
> > vmemmap.
> > It may look fine, but this pointer does not point to a struct page
> > unless
> > it is part of the regular memory (or you somehow allocated struct page
> > for
> > the range). Can you check to see if the struct page pointer from
> > vmalloc_to_page actually points a struct page entry?
> >
> > /* memmap is virtually contiguous. */
> > #define __pfn_to_page(pfn) (vmemmap + (pfn))
> > #define __page_to_pfn(page) (unsigned long)((page) - vmemmap)
>
> Yes, you are correct, the 'struct page' pointer returned by
> vmalloc_to_page does not point to a "real" struct page entry. Neither
> have we allocated a struct page for the range. However, because we pass
> the returned pointer to streaming DMA mapping API's
> (dma_map_ops->dma_map_sg or dma_map_ops->dma_map_page) and all those
> functions do is call page_to_phys, they only care about the physical
> address, it used to work.
>
> Would it be possible to have a different API which can do this or can
> vmalloc_to_page be updated to handle huge ioremaps without crashing? Or
> would you have a suggestion for doing this differently?

You can do the following instead. If you have the physical address already
(i.e. the address you passed to ioremap), you can skip slow_virt_to_phys().
pfn_to_page() is a hack for the time being so that you can use the same
DMA mapping APIs.

phys = slow_virt_to_phys(vaddr);
page = pfn_to_page(phys >> PAGE_SHIFT);

Dan is working on the change to introduce __pfn_t. With this change, you
can pass a pfn, instead of a fake page pointer, to APIs. You may want to
check if your APIs are covered in this change.
https://lkml.org/lkml/2015/6/5/802

Thanks,
-Toshi


2015-07-21 20:45:00

by Toshi Kani

[permalink] [raw]
Subject: Re: Regression in v4.2-rc1: vmalloc_to_page with ioremap

On Tue, 2015-07-21 at 14:39 -0600, Toshi Kani wrote:
> On Tue, 2015-07-21 at 08:17 -0700, Ashutosh Dixit wrote:
> > On Mon, Jul 20 2015 at 12:21:18 PM, Toshi Kani <[email protected]>

:
> > Yes, you are correct, the 'struct page' pointer returned by
> > vmalloc_to_page does not point to a "real" struct page entry. Neither
> > have we allocated a struct page for the range. However, because we pass
> > the returned pointer to streaming DMA mapping API's
> > (dma_map_ops->dma_map_sg or dma_map_ops->dma_map_page) and all those
> > functions do is call page_to_phys, they only care about the physical
> > address, it used to work.
> >
> > Would it be possible to have a different API which can do this or can
> > vmalloc_to_page be updated to handle huge ioremaps without crashing?
> > Or
> > would you have a suggestion for doing this differently?
>
> You can do the following instead. If you have the physical address
> already
> (i.e. the address you passed to ioremap), you can skip
> slow_virt_to_phys().
> pfn_to_page() is a hack for the time being so that you can use the same
> DMA mapping APIs.
>
> phys = slow_virt_to_phys(vaddr);
> page = pfn_to_page(phys >> PAGE_SHIFT);

Forgot to mention. slow_virt_to_phys() is only defined in x86, but I think
your driver is x86-only. Let me know if this is a problem.

Thanks,
-Toshi

2015-07-22 07:13:21

by Dixit, Ashutosh

[permalink] [raw]
Subject: Re: Regression in v4.2-rc1: vmalloc_to_page with ioremap

On Tue, Jul 21 2015 at 01:39:10 PM, Toshi Kani <[email protected]> wrote:
>
> You can do the following instead. If you have the physical address already
> (i.e. the address you passed to ioremap), you can skip slow_virt_to_phys().
> pfn_to_page() is a hack for the time being so that you can use the same
> DMA mapping APIs.
>
> phys = slow_virt_to_phys(vaddr);
> page = pfn_to_page(phys >> PAGE_SHIFT);
>
> Dan is working on the change to introduce __pfn_t. With this change, you
> can pass a pfn, instead of a fake page pointer, to APIs. You may want to
> check if your APIs are covered in this change.
> https://lkml.org/lkml/2015/6/5/802

Thanks, we can do this for now till Dan's changes come online. Also, we
already have the physical address. We will submit a patch with this
change.

Ashutosh

2015-07-22 14:55:44

by Toshi Kani

[permalink] [raw]
Subject: Re: Regression in v4.2-rc1: vmalloc_to_page with ioremap

On Wed, 2015-07-22 at 00:13 -0700, Ashutosh Dixit wrote:
> On Tue, Jul 21 2015 at 01:39:10 PM, Toshi Kani <[email protected]> wrote:
> >
> > You can do the following instead. If you have the physical address
> > already
> > (i.e. the address you passed to ioremap), you can skip
> > slow_virt_to_phys().
> > pfn_to_page() is a hack for the time being so that you can use the
> > same
> > DMA mapping APIs.
> >
> > phys = slow_virt_to_phys(vaddr);
> > page = pfn_to_page(phys >> PAGE_SHIFT);
> >
> > Dan is working on the change to introduce __pfn_t. With this change,
> > you
> > can pass a pfn, instead of a fake page pointer, to APIs. You may want
> > to
> > check if your APIs are covered in this change.
> > https://lkml.org/lkml/2015/6/5/802
>
> Thanks, we can do this for now till Dan's changes come online. Also, we
> already have the physical address. We will submit a patch with this
> change.

Sounds great. Thanks,
-Toshi