2004-04-12 22:29:39

by Terence Ripperda

[permalink] [raw]
Subject: PAT support

Hi all,

quite a while back, I sent out email about adding Page Attribute Table support to the kernel (http://www.ker
neltraffic.org/kernel-traffic/kt20030616_219.html#3).

At the time, the concern was being able to mark remapped i/o pages Write Combined in the case that we ran out of MTRRs to do so. This was mainly for the agp aperture and framebuffer. With PCI Express systems coming out, the need changes slightly. PCI Express does not have a centralized aperture like agp that can be marked WC. Instead, individual system pages of memory need to be marked WC via PAT in the page tables. This significantly increases the need for PAT support under linux to maintain high performance levels on PCI Express systems.

I thought the best approach would be to handle some of the original feedback in the code before I came backto ping lkml. I discussed things a little offline with Andi Kleen. He suggested I focus on the simpler i/o regions first, then come back to handling main memory once I had that done and had gotten feedback on it.

I've worked on a mechanism (cachemap) to track what type of caching a region of memory is currently mapped as. when a new region of memory is mapped, cachemap is queried to make sure the new region's caching type matches the old type (or is compatible with the old type). if the cachemap query succeeds, it's safe to map the new i/o region, otherwise it's not safe.

for the first pass, I focused on testing ioremap. so the cachemap queries are only made from ioremap. I also added code to have the mtrr code call the cachemap code (in this case, it's a report rather than a query to indicate the mapping's already made). I've made a few test runs on systems here, and it seems to work fairly well.

this current patch includes the original PAT support and the new cachemap mechanism. note that the cachemap mechanism does not actually change any caching attributes, it only keeps track of the attributes and tests regions. I think the end idea would be that drivers would use the normal ioremap/change_page_attr/remap_page_range mechanisms like they already do, and these mechanisms would in turn use cachemap to make sure there's no conflicts. I'm completely open to how any specific details should work, and any changes needed to be made.

Thanks,
Terence


Attachments:
(No filename) (2.25 kB)
cachemap-1.9-2.6.4.patch.bz2 (9.37 kB)
Download all attachments

2004-04-13 00:01:41

by Andi Kleen

[permalink] [raw]
Subject: Re: PAT support

Terence Ripperda <[email protected]> writes:

Hi Terence,

> this current patch includes the original PAT support and the new
> cachemap mechanism. note that the cachemap mechanism does not
> actually change any caching attributes, it only keeps track of the
> attributes and tests regions. I think the end idea would be that
> drivers would use the normal
> ioremap/change_page_attr/remap_page_range mechanisms like they
> already do, and these mechanisms would in turn use cachemap to make
> sure there's no conflicts. I'm completely open to how any specific
> details should work, and any changes needed to be made.

Looks good for starting. The patch could use some minor cleanup still,
but it should be good enough for testing.

As for an interface - i still think it would be cleaner to just
call it from change_page_attr(). Then other users only need to
call a single function. But that's easily changed.

To make it really useful I think we need ioremap_wrcomb() and support
in the bus/pci mmap function (the PCI layer already has ioctls for this,
they are just ignored on i386 right). Then the X server could
start using it.

Without any users the testing coverage would be probably not too good,
but it needs some testing in the real world before it could
be merged first. Maybe it could be simply hooked into the AGP
driver and into some DRM driver. Then people would start testing
it at least.

-Andi

2004-04-13 05:34:37

by Manfred Spraul

[permalink] [raw]
Subject: Re: PAT support

Hi Terence,

in your patch, you write
+/* Here is the PAT's default layout on ia32 cpus when we are done.
+ * PAT0: Write Back
+ * PAT1: Write Combine
+ * PAT2: Uncached
+ * PAT3: Uncacheable
+ * PAT4: Write Through
+ * PAT5: Write Protect
+ * PAT6: Uncached
+ * PAT7: Uncacheable

Is that layout possible?
There is an errata in the B2 and C1 stepping of the Pentium 4 cpus that
results in incorrect PAT numbers: the highest bit is ignored by the CPU
under some circumstances. There's a similar errata (E27) that affects
all Pentium 3 cpus: The highest bit is always ignored.
I think we need a fallback to 4 PAT entries.

--
Manfred

2004-04-13 08:31:30

by Andy Whitcroft

[permalink] [raw]
Subject: Re: PAT support

--On 12 April 2004 17:29 -0500 Terence Ripperda <[email protected]>
wrote:

> quite a while back, I sent out email about adding Page Attribute Table
> support to the kernel (http://www.ker
> neltraffic.org/kernel-traffic/kt20030616_219.html#3).

> this current patch includes the original PAT support and the new cachemap
> mechanism. note that the cachemap mechanism does not actually change any
> caching attributes, it only keeps track of the attributes and tests
> regions. I think the end idea would be that drivers would use the normal
> ioremap/change_page_attr/remap_page_range mechanisms like they already
> do, and these mechanisms would in turn use cachemap to make sure there's
> no conflicts. I'm completely open to how any specific details should
> work, and any changes needed to be made.

Yeah, I've experemented recently with PAT too. Not had a chance to look at
the code in any specifics. But I did notice there appear to be no notes or
warnings on the issues of using PAT based mappings. Cirtainly there are
some _very_ onerus restrictions which I have personally tested and found to
be true :(. Perhaps we could get some big fat warning style comments.

Cirtainly the documentation is _very_ specific about this limitation, I
think I saw this being implmented in the code.

+ * According to the INTEL documentation it is the systems responsibility
+ * to ensure that the PAT registers are kept in agreement on all processors
+ * in a system. Changing these registers must occu in a manner which
+ * maintains the consistency of the processor caches and translation
+ * lookaside buggers (TLB).

Also, I have confirmed that if you have any Intel processors which do not
have the SS (Self Snoop) capability, you cannot have two independent
mappings to a page with different cache attributes. I have been hit by
this and you get stale data returned from the cache.

-apw

2004-04-13 16:22:56

by Terence Ripperda

[permalink] [raw]
Subject: Re: PAT support



On Mon, Apr 12, 2004 at 05:01:33PM -0700, [email protected] wrote:
> Looks good for starting. The patch could use some minor cleanup still,
> but it should be good enough for testing.

sure. I don't know all of the kernel style guidelines well enough, so I wouldn't be surprised if some of that was off. Also, I'm not sure if I did the architectural breakdown correctly.


> As for an interface - i still think it would be cleaner to just
> call it from change_page_attr(). Then other users only need to
> call a single function. But that's easily changed.

sure, I'm fine with that. I have a couple of related questions that might be dumb:

we discussed before how change_page_attr takes a struct page *. there are many cases where pci i/o memory is backed by struct pages, but in the majority of x86 desktops, this isn't the case. I'm unclear how these cases would be handled? would change_page_attr be changed to take address/size rather than struct page? would drivers be responsible for recognizing this case and calling a different api in that case (cmap directly)? should x86 be changed such that all memory is covered by struct pages (doubtful)?

a lot of mmaps currently use remap_page_range w/o change_page_attr currently. I had thought that by putting the cmap_request inside of remap_page_range, we would make sure we caught all remappings and didn't miss any potential conflicts. is the preference to have all drivers updated and be responsible for calling change_page_attr before remap_page_range? or perhaps I'm missing the obvious: ioremap calls change_page_attr in the correct cases, so perhaps remap_page_range should call change_page_attr, which would in turn call cmap_request.

the main reason I hadn't added the cmap_request call to change_page_attr was due to us focusing on i/o regions first, then tackling system memory later. I can go ahead and add the call, since cmap_request will currently recognize system memory and skip over it, returning success. I would then still need to figure out how to work change_page_attr and i/o regions, at least on x86.


> To make it really useful I think we need ioremap_wrcomb() and support
> in the bus/pci mmap function (the PCI layer already has ioctls for this,
> they are just ignored on i386 right). Then the X server could
> start using it.

is this pci_mmap_page_range? I hadn't seen that before (it looks like it was added to i386 in the 2.6 series). it does look like it's plugged into proc_bus_pci_mmap for i386 on 2.6, but perhaps I'm missing something when skimming the code. that should be easy to add (or would be picked up by a change to remap_page_range).

I've added ioremap_wrcomb in my tree, but need to test it still.


> Without any users the testing coverage would be probably not too good,
> but it needs some testing in the real world before it could
> be merged first. Maybe it could be simply hooked into the AGP
> driver and into some DRM driver. Then people would start testing
> it at least.

sure. I did a quick scan of the code, I see an mmap function in the agpgart code. it looks like some of the drm drivers (ffb/i810/i830) have mmap, but not all of them. I would assume they relied on the agpgart mmap. the agpgart mmap could be updated, based on the decision above (driver vs. change_page_attr/remap_page_range), would that be enough?

Thanks,
Terence

2004-04-13 16:41:27

by Terence Ripperda

[permalink] [raw]
Subject: Re: PAT support


oh, good catch. we added comments detailing this to an internal evaluation stand-alone pat module for pci express testing (the same pat.c/pat.h files as a standalone module), and I forgot to propogate those comments to this patch.

here's a snippet of our comments:

/* Here is the PAT's default layout on ia32 cpus at boot/reset.
* This is table 10-12 in the IA-32 Intel Architecture Software Developer's
* Manual Volume 3: System Programming Guide.
* PAT0: Write Back
* PAT1: Write Through
* PAT2: Uncached (/ Uncacheable on Athlon cpus)
* PAT3: Uncacheable
* PAT4: Write Back
* PAT5: Write Through
* PAT6: Uncached (/ Uncacheable on Athlon cpus)
* PAT7: Uncacheable
*
* Rather than mucking with the PAT entries too much, we'll only replace one
* entry with Write Combining. Ideally, we'd do that to one of the latter 4
* entries, but due to errata in the P3/P4 cpus, we can only use the first
* four entries reliably.
*
* For that reason, we'll replace PAT1: Write Through with Write Combine.
*
* For more details about these errata, see Intel Pentium III Processor
* Specification Update, errata E.27 (Upper Four PAT Entries Not Usable With
* Mode B or Mode C Paging) and Intel Pentium IV Processor Specification
* Update, errata N46 (PAT Index MSB May Be Calculated Incorrectly)
*/

rather than reprogramming all of the pats (compare the above table to the below table), we only replaced one entry with write-combining. I could just update cachemap with the corrected comments, or go ahead and only update the one pat entry, whichever everyone prefers.

Thanks,
Terence

On Mon, Apr 12, 2004 at 10:34:22PM -0700, [email protected] wrote:
> Hi Terence,
>
> in your patch, you write
> +/* Here is the PAT's default layout on ia32 cpus when we are done.
> + * PAT0: Write Back
> + * PAT1: Write Combine
> + * PAT2: Uncached
> + * PAT3: Uncacheable
> + * PAT4: Write Through
> + * PAT5: Write Protect
> + * PAT6: Uncached
> + * PAT7: Uncacheable
>
> Is that layout possible?
> There is an errata in the B2 and C1 stepping of the Pentium 4 cpus that
> results in incorrect PAT numbers: the highest bit is ignored by the CPU
> under some circumstances. There's a similar errata (E27) that affects
> all Pentium 3 cpus: The highest bit is always ignored.
> I think we need a fallback to 4 PAT entries.
>
> --
> Manfred
>

2004-04-13 16:52:11

by Terence Ripperda

[permalink] [raw]
Subject: Re: PAT support

On Tue, Apr 13, 2004 at 01:36:13AM -0700, [email protected] wrote:
> But I did notice there appear to be no notes or
> warnings on the issues of using PAT based mappings. Cirtainly there are
> some _very_ onerus restrictions which I have personally tested and found to
> be true :(. Perhaps we could get some big fat warning style comments.

where would you like to see such warnings? arguably, all of the dangerous conditions should be handled by this core code to avoid problems (such as only using the first 4 pat entries, flushing the correct caches when updating the pat entries or pte bits). these problems really aren't all that different than any other cache aliasing/pte flushing issues that always exist, right?

> + * According to the INTEL documentation it is the systems responsibility
> + * to ensure that the PAT registers are kept in agreement on all processors
> + * in a system. Changing these registers must occu in a manner which
> + * maintains the consistency of the processor caches and translation
> + * lookaside buggers (TLB).

absolutely. I tried to handle this by initializing the pat entries as each cpu comes online at boot time, with cache flushes. I think Andi mentioned flushing the TLBs as well, I'll check up on that to make sure I'm doing that as well.

> Also, I have confirmed that if you have any Intel processors which do not
> have the SS (Self Snoop) capability, you cannot have two independent
> mappings to a page with different cache attributes. I have been hit by
> this and you get stale data returned from the cache.

certainly, that is more or less what this mechanism is intended to prevent. cmap_compare_cachings is an arch-dependent function, which allows architectures to allow/disallow different cache attributes. I certainly consider the current implementation to be a sample that needs some tweaking. for example, I forgot that I had allowed CACHED/NOCACHED overlaps, due to MTRRs that legally overlap like this. but that's probably not a situation we want for any other case, so I need to fix that.

Thanks,
Terence

2004-04-13 20:51:34

by Pavel Machek

[permalink] [raw]
Subject: Re: PAT support

Hi!

> in your patch, you write
> +/* Here is the PAT's default layout on ia32 cpus when we are done.
> + * PAT0: Write Back
> + * PAT1: Write Combine
> + * PAT2: Uncached
> + * PAT3: Uncacheable
> + * PAT4: Write Through
> + * PAT5: Write Protect
> + * PAT6: Uncached
> + * PAT7: Uncacheable
>
> Is that layout possible?
> There is an errata in the B2 and C1 stepping of the Pentium 4 cpus
> that results in incorrect PAT numbers: the highest bit is ignored by
> the CPU under some circumstances. There's a similar errata (E27) that
> affects all Pentium 3 cpus: The highest bit is always ignored.
> I think we need a fallback to 4 PAT entries.

What about arranging it so that foo is always more restrictive than
foo|4? That way you can get some slowdowns on bad cpus, but it
will always be correct.
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2004-04-14 00:58:39

by Andi Kleen

[permalink] [raw]
Subject: Re: PAT support

[I put Egbert Eich in cc who may be interested in the X server
side of this]

On Tue, Apr 13, 2004 at 11:21:08AM -0500, Terence Ripperda wrote:
>
>
> On Mon, Apr 12, 2004 at 05:01:33PM -0700, [email protected] wrote:
> > Looks good for starting. The patch could use some minor cleanup still,
> > but it should be good enough for testing.
>
> sure. I don't know all of the kernel style guidelines well enough, so I wouldn't be surprised if some of that was off. Also, I'm not sure if I did the architectural breakdown correctly.
>
>
> > As for an interface - i still think it would be cleaner to just
> > call it from change_page_attr(). Then other users only need to
> > call a single function. But that's easily changed.
>
> sure, I'm fine with that. I have a couple of related questions that might be dumb:
>
> we discussed before how change_page_attr takes a struct page *. there are many cases where pci i/o memory is backed by struct pages, but in the majority of x86 desktops, this isn't the case. I'm unclear how these cases would be handled? would change_page_attr be changed to take address/size rather than struct page? would drivers be responsible for recognizing this case and calling a different api in that case (cmap directly)? should x86 be changed such that all memory is covered by struct pages (doubtful)?

Covering everything with struct page * would be a waste of memory
(the mappings tend to be at the end of 4GB and just covering 4GB
with struct page costs considerable memory)

I would just change change_page_attr() back to use physical addresses.
In fact the original version did that, but someone complained that
it may get used for highmem too - but that was never handled
(it BUGs right now) and doesn't make sense because highmem is not
in the direct mapping.


> a lot of mmaps currently use remap_page_range w/o change_page_attr currently. I had thought that by putting the cmap_request inside of remap_page_range, we would make sure we caught all remappings and didn't miss any potential conflicts. is the preference to have all drivers updated and be responsible for calling change_page_attr before remap_page_range? or perhaps I'm missing the obvious: ioremap calls change_page_attr in the correct cases, so perhaps remap_page_range should call change_page_attr, which would in turn call cmap_request.

Normally it's not needed on i386 because the mappings remapped by
remap_page_range are not in the direct mapping (which only maps the first
~900MB) on i386. On x86-64 it is buggy when you have enough memory,
because the PCI IO regions below 4GB will be in the direct mapping then.
Then it should call change_page_attr when it uses a non caching
attribute, agreed.

It could be a problem on x86 too if someone remaps a mapping in the
640k-1MB hole, but that's probably unlikely.


>
> the main reason I hadn't added the cmap_request call to change_page_attr was due to us focusing on i/o regions first, then tackling system memory later. I can go ahead and add the call, since cmap_request will currently recognize system memory and skip over it, returning success. I would then still need to figure out how to work change_page_attr and i/o regions, at least on x86.

For x86 it's not needed, only for x86-64, but it would be nice to use
common behaviour between the two.

>
>
> > To make it really useful I think we need ioremap_wrcomb() and support
> > in the bus/pci mmap function (the PCI layer already has ioctls for this,
> > they are just ignored on i386 right). Then the X server could
> > start using it.
>
> is this pci_mmap_page_range? I hadn't seen that before (it looks like it was added to i386 in the 2.6 series). it does look like it's plugged into proc_bus_pci_mmap for i386 on 2.6, but perhaps I'm missing something when skimming the code. that should be easy to add (or would be picked up by a change to remap_page_range).
>
> I've added ioremap_wrcomb in my tree, but need to test it still.

Yes. I thought it was already in 2.4?

The pci mmap also has an ioctl - see drivers/pci/proc.c:proc_bus_pci_ioctl -
to set write combining. It would be nice if that was just hooked up.



> > Without any users the testing coverage would be probably not too good,
> > but it needs some testing in the real world before it could
> > be merged first. Maybe it could be simply hooked into the AGP
> > driver and into some DRM driver. Then people would start testing
> > it at least.
>
> sure. I did a quick scan of the code, I see an mmap function in the agpgart code. it looks like some of the drm drivers (ffb/i810/i830) have mmap, but not all of them. I would assume they relied on the agpgart mmap. the agpgart mmap could be updated, based on the decision above (driver vs. change_page_attr/remap_page_range), would that be enough?

Yes.

And possible the X server to use /proc/bus/pci/* instead of /dev/mem
and then use the wrcombining ioctl instead of /proc/mtrr.

-Andi

2004-04-15 04:05:55

by Eric W. Biederman

[permalink] [raw]
Subject: Re: PAT support

Manfred Spraul <[email protected]> writes:

> Hi Terence,
>
> in your patch, you write
> +/* Here is the PAT's default layout on ia32 cpus when we are done.
> + * PAT0: Write Back
> + * PAT1: Write Combine
> + * PAT2: Uncached
> + * PAT3: Uncacheable
> + * PAT4: Write Through
> + * PAT5: Write Protect
> + * PAT6: Uncached
> + * PAT7: Uncacheable
>
> Is that layout possible?
> There is an errata in the B2 and C1 stepping of the Pentium 4 cpus that results
> in incorrect PAT numbers: the highest bit is ignored by the CPU under some
> circumstances. There's a similar errata (E27) that affects all Pentium 3 cpus:
> The highest bit is always ignored.
> I think we need a fallback to 4 PAT entries.

Write Through and Write Protect are essentially useless. So that should
be easy to do.

Eric

2004-04-15 04:11:57

by Eric W. Biederman

[permalink] [raw]
Subject: Re: PAT support

Andi Kleen <[email protected]> writes:

> Terence Ripperda <[email protected]> writes:
>
> Hi Terence,
>
> > this current patch includes the original PAT support and the new
> > cachemap mechanism. note that the cachemap mechanism does not
> > actually change any caching attributes, it only keeps track of the
> > attributes and tests regions. I think the end idea would be that
> > drivers would use the normal
> > ioremap/change_page_attr/remap_page_range mechanisms like they
> > already do, and these mechanisms would in turn use cachemap to make
> > sure there's no conflicts. I'm completely open to how any specific
> > details should work, and any changes needed to be made.
>
> Looks good for starting. The patch could use some minor cleanup still,
> but it should be good enough for testing.
>
> As for an interface - i still think it would be cleaner to just
> call it from change_page_attr(). Then other users only need to
> call a single function. But that's easily changed.
>
> To make it really useful I think we need ioremap_wrcomb() and support
> in the bus/pci mmap function (the PCI layer already has ioctls for this,
> they are just ignored on i386 right). Then the X server could
> start using it.
>
> Without any users the testing coverage would be probably not too good,
> but it needs some testing in the real world before it could
> be merged first. Maybe it could be simply hooked into the AGP
> driver and into some DRM driver. Then people would start testing
> it at least.

This would also be extremely useful on machines with large amounts
of memory, for write-back mappings. With large amounts but odd sized
entries it becomes extremely tricky to map all of the memory using
mtrrs.

Last I looked the common remedy was to using overlapping mtrrs
which the kernel does not understand and that make it impossible
for X to setup write-combining MTRRs.

The memory map on x86 shows no hope of getting simpler and mtrrs
are getting continually less good at being able to scale so getting
PAT support of some kind in the kernel would be extremely good.

Eric

2004-04-15 16:38:21

by Andi Kleen

[permalink] [raw]
Subject: Re: PAT support

> This would also be extremely useful on machines with large amounts
> of memory, for write-back mappings. With large amounts but odd sized
> entries it becomes extremely tricky to map all of the memory using
> mtrrs.

Yes agreed. I already had vendors complaining about this.
But for this it will need some more work - the MTRRs need to be fully
converted to PAT and then disabled (because MTRRs have
higher priority than PAT). Doing so is a lot more risky than
what Terrence's patch does currently though. But longer term
we will need it.

Also it will still need to handle overlapping ranges. I suppose
it will need some simple rules like: converting from UC to WC is
always ok.

-Andi

2004-04-15 18:45:53

by Eric W. Biederman

[permalink] [raw]
Subject: Re: PAT support

Andi Kleen <[email protected]> writes:

> > This would also be extremely useful on machines with large amounts
> > of memory, for write-back mappings. With large amounts but odd sized
> > entries it becomes extremely tricky to map all of the memory using
> > mtrrs.
>
> Yes agreed. I already had vendors complaining about this.
> But for this it will need some more work - the MTRRs need to be fully
> converted to PAT and then disabled (because MTRRs have
> higher priority than PAT). Doing so is a lot more risky than
> what Terrence's patch does currently though. But longer term
> we will need it.

Ugh. You are right. The processors look at the two types and pick
the one that caches the least. So PAT can't enable caching :(

> Also it will still need to handle overlapping ranges. I suppose
> it will need some simple rules like: converting from UC to WC is
> always ok.

Right.

That plus it should have some additional rules like the
e820 map trumps the mtrrs in specifying what is memory so
should be cacheable.

Eric


2004-04-16 00:00:50

by Albert Cahalan

[permalink] [raw]
Subject: Re: PAT support

Eric W. Biederman writes:
> Andi Kleen <[email protected]> writes:

>> Yes agreed. I already had vendors complaining about this.
>> But for this it will need some more work - the MTRRs need to be fully
>> converted to PAT and then disabled (because MTRRs have
>> higher priority than PAT). Doing so is a lot more risky than
>> what Terrence's patch does currently though. But longer term
>> we will need it.
>
> Ugh. You are right. The processors look at the two types and pick
> the one that caches the least. So PAT can't enable caching :(

There's more to it than this. You need to use both
the MTRRs and PAT for best performance. I can't find
the explanation in my AMD manual, so maybe this is
an Intel-only thing. From (human) memory:

Use the PAT stuff as your primary cache-control
mechanism. Then, to the extent that you can, use
the MTRRs to double-mark some of the uncached or
uncachable memory. This avoids some sort of
useless bus traffic or TLB goings-on.

Sorry I can't be clearer; check the Intel books.



2004-04-16 18:11:28

by Terence Ripperda

[permalink] [raw]
Subject: Re: PAT support


Thanks for the great feedback (as always), Andi.

I've gotten started on this, mainly focusing on modifying change_page_attr to use unsigned long instead of struct page*. hit a few snags on the way, but got things figured out. it should be working now, but may still have a bug or two I need to work out. I also moved cmap_request_range into change_page_attr as suggested.

what I wanted to bring up is some of the issues I'm running into that will require some slight adjustments. I wanted to make sure to get feedback on this.

the first is that change_page_attr is basically a one-way operation that just sets the caching for a range of pages. in contrast, the cmap_ api is more of a push/pop approach, with cmap_request_range and cmap_release_range. all callers of change_page_attr generally use it to set a new attribute, then assume they need to restore the attribute back to PAGE_KERNEL (WB). additionally, when splitting/merging large pages, change_page_attr has similar assumptions (that we flip between a default PAGE_KERNEL caching and some other caching).

I haven't thought through all the details yet, but would it be acceptable to consider modifying change_page_attr to also use a push/pop approach? it may also make sense to remove the assumptions about initial state and actually save off what the initial state is, so it can be restored to that state.

the second is in iounmap. this function only takes an address, we don't get the size until we've retrieved the struct vm_struct* by freeing the mapping via remove_vm_area. once we have this size, we can call change_page_attr. but by then, the virtual mapping was already freed in remove_vm_area, so playing with the page tables in change_page_attr causes problems:

remap_area_pte: page already exists
------------[ cut here ]------------
kernel BUG at arch/i386/mm/ioremap.c:37!
invalid operand: 0000 [#1]
...
Call Trace:
[<c011d9c4>] __ioremap+0xbc/0xec
[<c0228012>] acpi_os_map_memory+0x2a/0x44
[<c023c11a>] acpi_tb_get_table_header+0x52/0xdc
[<c023c076>] acpi_tb_get_table+0x16/0x68
[<c023fb87>] acpi_ut_allocate_owner_id+0x8f/0x9c
...

in this case, I think change_page_attr filled the pte back in after it had been freed by remove_vm_area.

this could be fixed multiple ways: change_page_attr could not touch the page tables at all for i/o regions (would ioremap cover everything? are 4M ptes used on i/o regions?), we could check for the PAGE_PRESENT bit (seems like a hack workaround), I could add an api to retrieve the struct vm_struct* so we could call change_page_attr before calling remove_vm_area. I wanted to check and see what the preferred approach was.

(aside: we only call change_page_attr if we're calling a non-standard ioremap, like ioremap_nocache. I need to move this logic into ioremap, so we get a cmap_ for all ioremaps, regardless of caching type. note that iounmap currently has no idea what caching ioremap requested.)

also, I'm attaching my current progress. note that this is a debug-work-in-progress, so has lots of extra printouts and perhaps some code that would be cleaned up (for example, I duplicated lookup_address -> lookup_phys_address to make some things easier, but that might be cleaned up before long). I also haven't had time to get to all the suggestions, but will get to them.

Thanks,
Terence


Attachments:
(No filename) (3.24 kB)
cachemap-1.10-pre1-2.6.4.bz2 (13.01 kB)
Download all attachments

2004-04-17 00:42:20

by Andi Kleen

[permalink] [raw]
Subject: Re: PAT support

On Fri, Apr 16, 2004 at 01:07:38PM -0500, Terence Ripperda wrote:
> the first is that change_page_attr is basically a one-way operation that just sets the caching for a range of pages. in contrast, the cmap_ api is more of a push/pop approach, with cmap_request_range and cmap_release_range. all callers of change_page_attr generally use it to set a new attribute, then assume they need to restore the attribute back to PAGE_KERNEL (WB). additionally, when splitting/merging large pages, change_page_attr has similar assumptions (that we flip between a default PAGE_KERNEL caching and some other caching).
>
> I haven't thought through all the details yet, but would it be acceptable to consider modifying change_page_attr to also use a push/pop approach? it may also make sense to remove the assumptions about initial state and actually save off what the initial state is, so it can be restored to that state.

change_page_attr can change more than just caching attributes,
also read/write (e.g. slab debug uses it for that)

At least for the later adding another book keeping data structure
may be too expensive. Of course the data structure is only needed
for change that set a non standard caching attribute,but having
an release function would require a handle to release, forcing
it on the others too.

I think I prefer the do/undo model instead of push/pop.
That can work with cmaps too. PAGE_KERNEL means no cmap,
PAGE_KERNEL_WC and PAGE_KERNEL_NOCACHE get a cmap.

> the second is in iounmap. this function only takes an address, we don't get the size until we've retrieved the struct vm_struct* by freeing the mapping via remove_vm_area. once we have this size, we can call change_page_attr. but by then, the virtual mapping was already freed in remove_vm_area, so playing with the page tables in change_page_attr causes problems:

remove_vm_area() needs to just be split into some worker functions
(__remove_vm_area et.al.)
>
> this could be fixed multiple ways: change_page_attr could not touch the page tables at all for i/o regions (would ioremap cover everything? are 4M ptes used on i/o regions?), we could check for the PAGE_PRESENT bit (seems like a hack workaround), I could add an api to retrieve the struct vm_struct* so we could call change_page_attr before calling remove_vm_area. I wanted to check and see what the preferred approach was.

The latest sounds cleanest to me.

> (aside: we only call change_page_attr if we're calling a non-standard ioremap, like ioremap_nocache. I need to move this logic into ioremap, so we get a cmap_ for all ioremaps, regardless of caching type. note that iounmap currently has no idea what caching ioremap requested.)

Why? no cmap means write back, no?

It's probably an academic case anyways - ioremaps without _nocache
should be near always from main memory, which it would ignore.

-Andi

2004-04-19 22:56:19

by Terence Ripperda

[permalink] [raw]
Subject: Re: PAT support

On Fri, Apr 16, 2004 at 05:42:17PM -0700, [email protected] wrote:
> change_page_attr can change more than just caching attributes,
> also read/write (e.g. slab debug uses it for that)

ah, ok. sorry, missed that.

> At least for the later adding another book keeping data structure
> may be too expensive.

makes sense.

> I think I prefer the do/undo model instead of push/pop.
> That can work with cmaps too. PAGE_KERNEL means no cmap,
> PAGE_KERNEL_WC and PAGE_KERNEL_NOCACHE get a cmap.

but then what is the point of cmap? I would expect a mix of WC and UC mappings to be much less dangerous than a mix of WC/UC and WB. perhaps my mindset is wrong, but it seems allowing ioremap to request a cached mapping is important, and that if that mapping was followed by ioremap_nocached or ioremap_wrcomb, that these subsequent calls should fail.

I did finish implementing your suggestion, that change_page_attr should consider PAGE_KERNEL as a call to cmap_release_range and anything else as a call to cmap_request_range. seemed to work ok, but I'm seeing the acpi table code doing a lot of ioremaps (cached) that are ignored, then iounmaps are causing cmap_release_range calls to complain about not finding the regions. of course in a final version, we'd cut out the debug output, but expecting lots of empty calls to cmap_release_range seems messy.

what if there was a restore_page_attr(unsigned long address, unsigned long numpages) that assumed the pgprot was PAGE_KERNEL. change_page_attr knows to call cmap_request_range and restore_page_attr knows to call cmap_release_range. otherwise they do the same thing, restore_ just inherently uses PAGE_KERNEL for the caching type.

> remove_vm_area() needs to just be split into some worker functions
> (__remove_vm_area et.al.)

ok, easily done.

Thanks,
Terence

2004-04-20 18:51:20

by Andi Kleen

[permalink] [raw]
Subject: Re: PAT support

On Mon, Apr 19, 2004 at 05:54:57PM -0500, Terence Ripperda wrote:
> > I think I prefer the do/undo model instead of push/pop.
> > That can work with cmaps too. PAGE_KERNEL means no cmap,
> > PAGE_KERNEL_WC and PAGE_KERNEL_NOCACHE get a cmap.
>
> but then what is the point of cmap? I would expect a mix of WC and UC mappings to be much less dangerous than a mix of WC/UC and WB. perhaps my mindset is wrong, but it seems allowing ioremap to request a cached mapping is important, and that if that mapping was followed by ioremap_nocached or ioremap_wrcomb, that these subsequent calls should fail.

Hmm, you're right. push/pop is probably better for io-mappings, otherwise
we cannot catch existing mappings. This will be needed for user mmap
too.

Ignore my previous suggestion on that then please. Sorry for the noise.


-Andi

2004-04-21 23:20:05

by Terence Ripperda

[permalink] [raw]
Subject: Re: PAT support


ok, got this finished up. modified change_page_attr to take a kernel virtual address instead of a page struct. also introduced restore_page_attr, which releases the cmap_range and assumed a pgprot of type PAGE_KERNEL. tweaked the change_page_attr a little to allow virtual addresses, convert from virtual to physical. if you'd prefer any style changes to what I made, feel free to flame me.

I also merged the ioremap functions into a single function (which really just means moving change_page_attr into __ioremap) and made ioremap, ioremap_nocache, and ioremap_wrcomb as inline routines that send the appropriate flags into __ioremap.

fixed an oversight in cmap_convert_flags.

updated all references to change_page_attr to use the correct arguments, but still need to finish the core function changes for x86_64.


between finishing this up and making a first pass at the /proc/bus/pci/* changes, I made some "doh!" realizations:

change_page_attr takes a kernel virtual address currently. this doesn't work well for mmap situations, which have a user virtual address. lookup_address could theoretically be changed to handle this (by switching between kernel & user pgd), but this wouldn't work for 4G/4G kernels.

change_page_attr can't take a physical address, since then it couldn't change_page_attr (change the page table attributes).

a side effect of having cmap_request_range being part of change_page_attr is that the caller must first setup the page tables for the virtual mapping, then check if the caching is ok. it would seem preferrable to use cmap_request_range to check if the caching is ok, setup the page tables, then modify the page tables via change_page_attr (for kernel pages).

I haven't looked too closely yet, but it would seem like all of this could be encapsulated in remap_page_range and ioremap (ie, driver writers would never call cmap_ routines directly). I'll spend a day or two looking into how well this works.


Thanks,
Terence

On Tue, Apr 20, 2004 at 11:51:12AM -0700, [email protected] wrote:
> On Mon, Apr 19, 2004 at 05:54:57PM -0500, Terence Ripperda wrote:
> > > I think I prefer the do/undo model instead of push/pop.
> > > That can work with cmaps too. PAGE_KERNEL means no cmap,
> > > PAGE_KERNEL_WC and PAGE_KERNEL_NOCACHE get a cmap.
> >
> > but then what is the point of cmap? I would expect a mix of WC and UC
> mappings to be much less dangerous than a mix of WC/UC and WB. perhaps
> my mindset is wrong, but it seems allowing ioremap to request a cached
> mapping is important, and that if that mapping was followed by
> ioremap_nocached or ioremap_wrcomb, that these subsequent calls should
> fail.
>
> Hmm, you're right. push/pop is probably better for io-mappings,
> otherwise
> we cannot catch existing mappings. This will be needed for user mmap
> too.
>
> Ignore my previous suggestion on that then please. Sorry for the noise.
>
>
> -Andi


Attachments:
(No filename) (2.83 kB)
cachemap-1.10-2.6.4.patch.bz2 (13.25 kB)
Download all attachments

2004-04-22 04:22:09

by Andi Kleen

[permalink] [raw]
Subject: Re: PAT support

On Wed, Apr 21, 2004 at 06:19:42PM -0500, Terence Ripperda wrote:
>
> ok, got this finished up. modified change_page_attr to take a kernel virtual address instead of a page struct. also introduced restore_page_attr, which releases the cmap_range and assumed a pgprot of type PAGE_KERNEL. tweaked the change_page_attr a little to allow virtual addresses, convert from virtual to physical. if you'd prefer any style changes to what I made, feel free to flame me.

I leave the style wars to others who are more motivated at nitpicking
than me ;-)

actually it doesn't look that bad.

>
> I also merged the ioremap functions into a single function (which really just means moving change_page_attr into __ioremap) and made ioremap, ioremap_nocache, and ioremap_wrcomb as inline routines that send the appropriate flags into __ioremap.
>
> fixed an oversight in cmap_convert_flags.
>
> updated all references to change_page_attr to use the correct arguments, but still need to finish the core function changes for x86_64.
>
>
> between finishing this up and making a first pass at the /proc/bus/pci/* changes, I made some "doh!" realizations:
>
> change_page_attr takes a kernel virtual address currently. this doesn't work well for mmap situations, which have a user virtual address. lookup_address could theoretically be changed to handle this (by switching between kernel & user pgd), but this wouldn't work for 4G/4G kernels.

Yes, change_page_attr is only for the kernel direct mapping, nothing more.

All other mapers (pci, /dev/mem, agp, ioremap_*) must take a cmap
reference and prevent conflicting maps in the new model.

> change_page_attr can't take a physical address, since then it couldn't change_page_attr (change the page table attributes).

It can: there is a 1:1 mapping from physical to kernel virtual
address with __va (except highmem, but that's not interesting here because it's
not mapped in the kernel and doesn't need to be changed)

actually when user space PAT is supported highmem kmap_* may need to take a look
at the PTE or lookup the cmap because it also does a private mapping, but that
would get quite expensive and kmap is time critical. My gut feeling
would be to just ignore this case for now - it could only happen e.g. when
someone does write(fd, mmapfromhardware, ...) and that could be
left undefined.

> a side effect of having cmap_request_range being part of change_page_attr is that the caller must first setup the page tables for the virtual mapping, then check if the caching is ok. it would seem preferrable to use cmap_request_range to check if the caching is ok, setup the page tables, then modify the page tables via change_page_attr (for kernel pages).

True.

>
> I haven't looked too closely yet, but it would seem like all of this could be encapsulated in remap_page_range and ioremap (ie, driver writers would never call cmap_ routines directly). I'll spend a day or two looking into how well this works.

Yes.

But there should be more low level calls too for the few users who cannot
use these high level functions for some reason.

Quick review:

set_pat: shouldn't the unknown CPU case error out?

pageattr: i think it should work with PFNs, otherwise you cannot
handle >4GB on i386. while IO mappings are normally <4GB
this will be needed for memory and in theory some mapping
could be above 4GB.

i'm not sure what lookup_phys_address is good for. Why don't
you just use __pa ? Or just pass the physical address ?

+ /* I think this is wrong, i/o region is above highmem? */
+ if (pfn_valid(pfn) && (pfn_to_page(pfn) >= highmem_start_page))

when it doesn't handle cmaps it should just return for any
memory >= high_memory

+ if ((c_tmp->end + 1) & 0xfff) BUG();
better use PAGE_MASK instead of hardcoding.

-Andi