2015-04-13 17:49:47

by Luis Chamberlain

[permalink] [raw]
Subject: Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR

Cc'ing a few others as we ended up talking about the cruxes of my
unposted v2 series as I confirmed that set_memor_wc() would not work
as an alternative to my originally proposed __arch_phys_wc_add() to
force MTRR as a last resort on a few set of last remaining drivers.
This also discusses overlapping ioremap() calls and what we'd need
for a default shift from UC- to strong UC.

On Fri, Apr 10, 2015 at 11:25:22PM -0700, Andy Lutomirski wrote:
> On Apr 10, 2015 6:29 PM, "Luis R. Rodriguez" <[email protected]> wrote:
> >
> > On Fri, Apr 10, 2015 at 02:22:51PM -0700, Andy Lutomirski wrote:
> > > On Fri, Apr 10, 2015 at 1:58 PM, Toshi Kani <[email protected]> wrote:
> > > > On Fri, 2015-04-10 at 23:05 +0200, Luis R. Rodriguez wrote:
> > > >> On Fri, Apr 10, 2015 at 01:49:39PM -0600, Toshi Kani wrote:
> > > >> > On Fri, 2015-04-10 at 12:34 -0700, Luis R. Rodriguez wrote:
> > > >> > > On Fri, Apr 10, 2015 at 12:14 PM, Andy Lutomirski <[email protected]> wrote:
> > > >> > > > On Fri, Apr 10, 2015 at 10:17 AM, Luis R. Rodriguez <[email protected]> wrote:
> > > >> > > >>
> > > >> > > >> Andy, I'm ready to post a v2 series based on review of the first iteration of
> > > >> > > >> the bury-MTRR series however I ran into a snag in vetting for the ioremap_uc()
> > > >> > > >> followed by a set_memory_wc() strategy as a measure to both avoid when possible
> > > >> > > >> overlapping ioremap*() calls and/or to avoid the power of 2 MTRR size
> > > >> > > >> implications on having to use multiple MTRRs.
> > > >> > > >>
> > > >> > > >> I tested the strategy by just making my thinkpad's i915 driver use iremap_uc()
> > > >> > > >> which I add, and then use set_memory_wc(). To start off with I should note
> > > >> > > >> set_memory_*() helpers are x86 specific right now, this is not a problem for
> > > >> > > >> the series but its worth noting as we're replacing MTRR strategies which
> > > >> > > >> are x86 specific, but I am having issues getting set_memory_wc() take effect
> > > >> > > >> on my intel graphics card.
> > > >> > > >>
> > > >> > > >> I've reviewed if this is OK in code and I see no issues other than set_memory_*()
> > > >> > > >> helpers seem to be desirable for RAM, not IO memory, so was wondering if we need
> > > >> > > >> to add other helpers which can address IO memory or if this should work? The diff
> > > >> > > >> for the drivers is below, the actual commit for adding ioremap_uc() folllows
> > > >> > > >> with its commit log. Feedback / review on both is welcome as well as if you
> > > >> > > >> could help me figure out why this test-patch for i915 fails.
> > > >> > > >
> > > >> > > > I think they should work for IO memory, but I'm not really an authority here.
> > > >> > > >
> > > >> > > > Adding some people who have looked at the code recently.
> > > >> > >
> > > >> > > I was avoiding reviewing the cpa code but since this failed I just had
> > > >> > > to review it and I see nothing prevent it from being used on IO memory
> > > >> > > but -- memtype_rb_check_conflict() does prevent an overlap to be set
> > > >> > > on an *existing range* -- and since ioremap_uc() was used earlier the
> > > >> > > first reserve_memtype() with _PAGE_CACHE_MODE_WC by set_memory_wc() I
> > > >> > > believe should fail then. Please correct me if I'm wrong, I don't see
> > > >> > > the "conflicting memory types" print though, so not sure if it was
> > > >> > > because of that.
> > > >> > >
> > > >> > > I only started looking at this though but shouldn't this also mean we
> > > >> > > can't use overlapping ioremap() calls too? I thought that worked,
> > > >> > > because at least some drivers are using that strategy.
> > > >> >
> > > >> > set_memory_*() does not work with I/O memory ranges with the following
> > > >> > reasons:
> > > >> >
> > > >> > 1. __pa(addr) returns a fake address since there is no direct map.
> > > >> > 2. reserve_memtype() tracks regular memory and I/O memory differently.
> > > >> > For regular memory, set_memory_*() can modify WB with a new type since
> > > >> > reserve_memtype() does not track WB. For I/O memory, reserve_memtype()
> > > >> > detects a conflict when a given type is different from a tracked type.
> > > >>
> > > >> Interesting, but I also just realized I had messed up my test patch too,
> > > >> I checked for (!ret) instead of (ret). This works now.
> > > >>
> > > >> > > diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
> > > >> > > index dccdc8a..dd9501b 100644
> > > >> > > --- a/drivers/gpu/drm/i915/i915_gem_gtt.c
> > > >> > > +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
> > > >> > > @@ -1958,12 +1958,22 @@ static int ggtt_probe_common(struct drm_device *dev,
> > > >> > > gtt_phys_addr = pci_resource_start(dev->pdev, 0) +
> > > >> > > (pci_resource_len(dev->pdev, 0) / 2);
> > > >> > >
> > > >> > > - dev_priv->gtt.gsm = ioremap_wc(gtt_phys_addr, gtt_size);
> > > >> > > + dev_priv->gtt.gsm = ioremap_uc(gtt_phys_addr, gtt_size);
> > > >> > > if (!dev_priv->gtt.gsm) {
> > > >> > > DRM_ERROR("Failed to map the gtt page table\n");
> > > >> > > return -ENOMEM;
> > > >> > > }
> > > >> > >
> > > >> > > + printk("mcgrof:set_memory_wc() ggtt_probe_common()\n");
> > > >> > > +
> > > >> > > + ret = set_memory_wc((unsigned long) dev_priv->gtt.gsm, gtt_size >> PAGE_SHIFT);
> > > >> > > + if (!ret) {
> > > >>
> > > >> Mess up here.
> > > >>
> > > >> > > + DRM_ERROR("mcgrof: failed set_memory_wc()\n");
> > > >> > > + iounmap(dev_priv->gtt.gsm);
> > > >> > > + return ret;
> > > >> > > + }
> > > >> > > +
> > > >> > > +
> > > >> > > ret = setup_scratch_page(dev);
> > > >> > > if (ret) {
> > > >> > > DRM_ERROR("Scratch setup failed\n");
> > > >>
> > > >> as I read the code though reserve_memtype() should not allow for a change
> > > >> of an existing type as you note though (and therefore prevent also
> > > >> overlapping ioremap() calls on PAT), but since this is going through now
> > > >> I am not sure at if this is by chance somehow... ?
> > > >
> > > > Unless you fixed issue #1 above, set_memory_wc() passes a bogus address
> > > > to reserve_memtype(). Hence, it won't cause any conflict.
> > >
> > > Presumably fixing #1 would be okay (slow_virt_to_phys or whatever)
> > > since this stuff is already slow.
> >
> > That wouldn't cut it I think. That would still fail as the types
> > don't match. That means we would have to free_memtype() and
> > then alloc a new reserve_memtype(), but I don't think that's
> > enought still. For instance a new ioremap() does:
> >
> > 1) sanity checks on regions of memory
> > 2) get_vm_area_caller() for the range
> > 3) kernel_map_sync_memtype()
> > 4) ioremap_page_range()
> > 5) mmiotrace_ioremap()
> >
> > Correct me if I'm wrong but I think we'd need to do all this as well.
> > If so we'd be cutting and splicing an ioremap() range. This code would
> > seem tricky to get right. Maybe its best to not support this then and
> > simply have to live with having drivers do their own splititn gof
> > ioremap() calls.
> >
> > This *might* implicate some constraints on burrying MTRR though so...
> > please let me know what you think.
>
> To throw out more ideas, what if the drivers instead did their own
> get_vm_area vmap calls and then mapped the two consecutive regions
> separately at consecutive addresses?

Instead of having an API do it for them? They'd have to implement quite a bit.
Driver writers also tend to not have the best of judgement and *typically* just
copy and paste code, so we'd have to either annotate it well as a work around
for a few drivers (in my series there are 3 that need this in order for us to
bury MTRR) and not share the code or provide an API for these cases.

Note that I belive these findings should also mean, if I understand things
correctly, that when PAT is used overlapping ioremap*() calls may work with
the caveat that the type of cache used will depend on the offset used, one
cannot hope that using either offset will yield the same caching effects.
There might even be other caveats with this but that will depend on the
hardware and not sure what those effects are. The reason is that for the
the cache effects requiring the appropriate offset is reserve_memtype() on IO
memory will use a fake address and this can in turn mean that even though the
same "pa" address was used we could end up with it not really checking
conflicts with the original reserve_memtype(). Some parts of this complexity
were described above. I only saw a few drivers using overlapping ioremap*()
calls though on my MTRR review and they are all old devices so likely mostly
used on non-PAT systems, but there might be other corner cases elsewhere.

Lets recap, as I see it we have a few options with all this in mind on our
quest to bury MTRR (and later make strong UC default):

1) Let drivers do their own get_vm_area() calls as you note and handle the
cut and splicing of ioremap areas

2) Provide an API to split and splice ioremap ranges

3) Try to avoid these types of situations and let drivers simply try to
work towards a proper clean non-overlapping different ioremap() ranges

Let me know if you think of others but please keep in mind the UC- to strong
UC converstion we want to do later as well. We have ruled out using
set_memor_wc() now.

I prefer option 3), its technically possible but only for *new* drivers
and we'd need either some hard work to split the ioremap() areas or provide
a a set of *transient* APIs as I had originally proposed to phase / hope for
final transition. There are 3 drivers to address:

a) atyfb: fortunately I believe I have finished the split for the atyfb
driver, ioremap_uc() would be used on only the MMIO region while ioremap_wc()
on the framebuffer (with newly corrected fixed size). This would go in as
an example of what work was required to do a split.

b) ipath: while a split on ipath is possible the changes are quite
significant. Apart from changing the driver to use different offset bases
in different regions the driver also has a debug userspace filemap for
the entire register map, the code there would need to be modified to
use the right virtual address base depending on the virtual address accessed.
The qib driver already has logic which could be mimic'd for this fortunatley,
but still - this is considerable work. One side hack I was hoping for
was that overlapping ioremap*() calls with different page attribute
types would work even if the 2nd returned __iomem address was not used,
based on my review and Toshi's feedback on reserve_memtype() this would
_not work_, only using the right virtual address would ensure the right
caching technique is used. In fact since this is not really all clear
I would not be surprised if there are other caveats. Sticking to the original
__arch_phys_wc_add() to force MTRR use here might be best but note that
even if this is done an eventual change from ioremap_nocache() to default
from UC- to strong UC would cause a regression on the desired WC area, in
this case on the PIO buffers. We may need an API to keep UC- for drivers that
know that need it -- or we may need to really do tha hard work to try to
convert this driver to split the iorenmap*() areas. If we do not want to do the
work to split this driver's ioremap*() space we could live with having
__arch_phys_wc_add() and another API to force UC- even if later the default is
strong UC.

c) ivtv: the driver does not have the PCI space mapped out separately, and
in fact it actually does not do the math for the framebuffer, instead it lets
the device's own CPU do that and assume where its at, see
ivtvfb_get_framebuffer() and CX2341X_OSD_GET_FRAMEBUFFER, it has a get
but not a setter. Its not clear if the firmware would make a split easy.
We'd need ioremap_ucminus() here too and __arch_phys_wc_add().

>From the beginning it seems only framebuffer devices used MTRR/WC, lately it
seems infiniband drivers also find good use for for it for PIO TX buffers to
blast some sort of data, in the future I would not be surprised if other
devices found use for it. It may be true that the existing drivers that
requires the above type of work are corner cases -- but I wouldn't hold my
breath for that. The ivtv device is a good example of the worst type of
situations and these days. So perhap __arch_phys_wc_add() and a
ioremap_ucminus() might be something more than transient unless hardware folks
get a good memo or already know how to just Do The Right Thing (TM).

Luis


2015-04-16 00:04:46

by Andy Walls

[permalink] [raw]
Subject: Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR

Hi All,

On Mon, 2015-04-13 at 19:49 +0200, Luis R. Rodriguez wrote:
[snip]
> I only saw a few drivers using overlapping ioremap*()
> calls though on my MTRR review and they are all old devices so likely mostly
> used on non-PAT systems, but there might be other corner cases elsewhere.
>
> Lets recap, as I see it we have a few options with all this in mind on our
> quest to bury MTRR (and later make strong UC default):
>
> 1) Let drivers do their own get_vm_area() calls as you note and handle the
> cut and splicing of ioremap areas
>
> 2) Provide an API to split and splice ioremap ranges
>
> 3) Try to avoid these types of situations and let drivers simply try to
> work towards a proper clean non-overlapping different ioremap() ranges
>
> Let me know if you think of others but please keep in mind the UC- to strong
> UC converstion we want to do later as well. We have ruled out using
> set_memor_wc() now.
>
> I prefer option 3), its technically possible but only for *new* drivers
> and we'd need either some hard work to split the ioremap() areas or provide
> a a set of *transient* APIs as I had originally proposed to phase / hope for
> final transition. There are 3 drivers to address:
>
> [snip]

Just some background for folks:
The ivtv driver supports cards that perform Standard Definition PAL,
NTSC, and SECAM TV capture + hardware MPEG-2 encoding and MPEG-2
decoding + TV output.

Of the supported cards only *one* card type, the PVR-350 based on the
CX23415 chip, can perform the MPEG-2 or YUV video decoding and output,
with an OSD overlay. PVR-350's are legacy PCI cards that have been end
of life since 2088 or earlier. Despite that, there are still used units
available on Amazon and eBay.

The ivtvfb driver module is an *optional* companion driver module that
provides a framebuffer interface for the user to output an X display, FB
console, or whatever to a standard definition analog PAL, NTSC, or SECAM
TV or VCR. It does this by co-opting the OSD overlay of the video
decoding output engine and having it take up the whole screen.



> c) ivtv: the driver does not have the PCI space mapped out separately, and
> in fact it actually does not do the math for the framebuffer, instead it lets
> the device's own CPU do that and assume where its at, see
> ivtvfb_get_framebuffer() and CX2341X_OSD_GET_FRAMEBUFFER, it has a get
> but not a setter. Its not clear if the firmware would make a split easy.

The CX2341[56]'s firmware + hardware machine are notorious for bugs and
being hard to work with. It would be difficult to determine with any
certainty that the firmware returned predictable OSD framebuffer
addresses from one user's system to the next.


> We'd need ioremap_ucminus() here too and __arch_phys_wc_add().

Yes.

As a driver writer/maintainer, IMO the name ioremap_ucminus() is jargon,
without a clear meaning from the name, and maybe too x86 PAT specific?
The pat.txt file under Documentation didn't explain what UC- meant; I
had to go searching old LKML emails to understand it was a unchached
region that could be overridden with write combine regions.

To me names along, these lines would be better:
ioremap_nocache_weak(), or
ioremap_nocache_mutable(), or
ioremap_nocache() (with ioremap_nocache_strong() or something for
the UC version)


> From the beginning it seems only framebuffer devices used MTRR/WC,
[snip]
> The ivtv device is a good example of the worst type of
> situations and these days. So perhap __arch_phys_wc_add() and a
> ioremap_ucminus() might be something more than transient unless hardware folks
> get a good memo or already know how to just Do The Right Thing (TM).

Just to reiterate a subtle point, use of the ivtvfb is *optional*. A
user may or may not load it. When the user does load the ivtvfb driver,
the ivtv driver has already been initialized and may have functions of
the card already in use by userspace.

Hopefully no one is trying to use the OSD as framebuffer and the video
decoder/output engine for video display at the same time. But the video
decoder/output device nodes may already be open for performing ioctl()
functions so unmapping the decoder IO space out from under them, when
loading the ivtvfb driver module, might not be a good thing.

Regards,
Andy

2015-04-15 23:43:17

by Andy Lutomirski

[permalink] [raw]
Subject: Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR

On Wed, Apr 15, 2015 at 3:38 PM, Andy Walls <[email protected]> wrote:
> Hi All,
>
> On Mon, 2015-04-13 at 19:49 +0200, Luis R. Rodriguez wrote:
> [snip]
>> I only saw a few drivers using overlapping ioremap*()
>> calls though on my MTRR review and they are all old devices so likely mostly
>> used on non-PAT systems, but there might be other corner cases elsewhere.
>>
>> Lets recap, as I see it we have a few options with all this in mind on our
>> quest to bury MTRR (and later make strong UC default):
>>
>> 1) Let drivers do their own get_vm_area() calls as you note and handle the
>> cut and splicing of ioremap areas
>>
>> 2) Provide an API to split and splice ioremap ranges
>>
>> 3) Try to avoid these types of situations and let drivers simply try to
>> work towards a proper clean non-overlapping different ioremap() ranges
>>
>> Let me know if you think of others but please keep in mind the UC- to strong
>> UC converstion we want to do later as well. We have ruled out using
>> set_memor_wc() now.
>>
>> I prefer option 3), its technically possible but only for *new* drivers
>> and we'd need either some hard work to split the ioremap() areas or provide
>> a a set of *transient* APIs as I had originally proposed to phase / hope for
>> final transition. There are 3 drivers to address:
>>
>> [snip]
>
> Just some background for folks:
> The ivtv driver supports cards that perform Standard Definition PAL,
> NTSC, and SECAM TV capture + hardware MPEG-2 encoding and MPEG-2
> decoding + TV output.
>
> Of the supported cards only *one* card type, the PVR-350 based on the
> CX23415 chip, can perform the MPEG-2 or YUV video decoding and output,
> with an OSD overlay. PVR-350's are legacy PCI cards that have been end
> of life since 2088 or earlier. Despite that, there are still used units
> available on Amazon and eBay.
>
> The ivtvfb driver module is an *optional* companion driver module that
> provides a framebuffer interface for the user to output an X display, FB
> console, or whatever to a standard definition analog PAL, NTSC, or SECAM
> TV or VCR. It does this by co-opting the OSD overlay of the video
> decoding output engine and having it take up the whole screen.
>
>
>
>> c) ivtv: the driver does not have the PCI space mapped out separately, and
>> in fact it actually does not do the math for the framebuffer, instead it lets
>> the device's own CPU do that and assume where its at, see
>> ivtvfb_get_framebuffer() and CX2341X_OSD_GET_FRAMEBUFFER, it has a get
>> but not a setter. Its not clear if the firmware would make a split easy.
>
> The CX2341[56]'s firmware + hardware machine are notorious for bugs and
> being hard to work with. It would be difficult to determine with any
> certainty that the firmware returned predictable OSD framebuffer
> addresses from one user's system to the next.
>
>
>> We'd need ioremap_ucminus() here too and __arch_phys_wc_add().
>
> Yes.
>
> As a driver writer/maintainer, IMO the name ioremap_ucminus() is jargon,
> without a clear meaning from the name, and maybe too x86 PAT specific?
> The pat.txt file under Documentation didn't explain what UC- meant; I
> had to go searching old LKML emails to understand it was a unchached
> region that could be overridden with write combine regions.
>
> To me names along, these lines would be better:
> ioremap_nocache_weak(), or
> ioremap_nocache_mutable(), or
> ioremap_nocache() (with ioremap_nocache_strong() or something for
> the UC version)
>

IMO the right solution would be to avoid ioremapping the whole bar at
startup. Instead ioremap pieces once the driver learns what they are.
This wouldn't have any of these problems -- you'd ioremap() register
regions and you'd ioremap_wc() the framebuffer once you find it. If
there are regions of unknown purpose, just don't map them all.

Would this be feasible?

--Andy

2015-04-15 23:58:29

by Luis Chamberlain

[permalink] [raw]
Subject: Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR

Hey Andy, thanks for your review, adding Hyong-Youb Kim for review of the
full range ioremap_wc() idea below.

On Wed, Apr 15, 2015 at 06:38:51PM -0400, Andy Walls wrote:
> Hi All,
>
> On Mon, 2015-04-13 at 19:49 +0200, Luis R. Rodriguez wrote:
> > From the beginning it seems only framebuffer devices used MTRR/WC,
> [snip]
> > The ivtv device is a good example of the worst type of
> > situations and these days. So perhap __arch_phys_wc_add() and a
> > ioremap_ucminus() might be something more than transient unless hardware folks
> > get a good memo or already know how to just Do The Right Thing (TM).
>
> Just to reiterate a subtle point, use of the ivtvfb is *optional*. A
> user may or may not load it. When the user does load the ivtvfb driver,
> the ivtv driver has already been initialized and may have functions of
> the card already in use by userspace.

I suspected this and its why I note that a rewrite to address a clean
split with separate ioremap seems rather difficult in this case.

> Hopefully no one is trying to use the OSD as framebuffer and the video
> decoder/output engine for video display at the same time.

Worst case concern I have also is the implications of having overlapping
ioremap() calls (as proposed in my last reply) for different memory types
and having the different virtual memory addresse used by different parts
of the driver. Its not clear to me what the hardware implications of this
are.

> But the video
> decoder/output device nodes may already be open for performing ioctl()
> functions so unmapping the decoder IO space out from under them, when
> loading the ivtvfb driver module, might not be a good thing.

Using overlapping ioremap() calls with different memory types would address
this concern provided hardware won't barf both on the device and CPU. Hardware
folks could provide feedback or an ivtvfb user could test the patch supplied
on both non-PAT and PAT systems. Even so, who knows, this might work on some
systems while not on others, only hardware folks would know.

An alternative... is to just ioremap_wc() the entire region, including
MMIO registers for these old devices. I see one ethernet driver that does
this, myri10ge, and am curious how and why they ended up deciding this
and if they have run into any issues. I wonder if this is a reasonable
comrpomise for these 2 remaining corner cases.

Luis

2015-04-16 00:53:55

by Andy Walls

[permalink] [raw]
Subject: Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR

On Wed, 2015-04-15 at 16:42 -0700, Andy Lutomirski wrote:
> On Wed, Apr 15, 2015 at 3:38 PM, Andy Walls <[email protected]> wrote:

> >
>
> IMO the right solution would be to avoid ioremapping the whole bar at
> startup. Instead ioremap pieces once the driver learns what they are.
> This wouldn't have any of these problems -- you'd ioremap() register
> regions and you'd ioremap_wc() the framebuffer once you find it. If
> there are regions of unknown purpose, just don't map them all.
>
> Would this be feasible?

Feasible? Maybe.

Worth the time and effort for end-of-life, convential PCI hardware so I
can have an optimally performing X display on a Standard Def Analog TV
screen? Nope. I don't have that level of nostalgia.


We sort of know where some things are in the MMIO space due to
experimentation and past efforts examining the firmware binary.

Documentation/video4linux/cx2341x/fw-*.txt documents some things. The
driver code actually codifies a little bit more knowledge.

The driver code for doing transfers between host and card is complex and
fragile with some streams that use DMA, other streams that use PIO,
digging VBI data straight out of card memory, and scatter-gather being
broken on newer firmwares. Playing around with ioremapping will be hard
to get right and likely cause something in the code to break for the
primary use case of the ivtv supported cards.

Regards,
Andy

2015-04-16 00:58:44

by Andy Lutomirski

[permalink] [raw]
Subject: Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR

On Wed, Apr 15, 2015 at 4:59 PM, Andy Walls <[email protected]> wrote:
> On Wed, 2015-04-15 at 16:42 -0700, Andy Lutomirski wrote:
>> On Wed, Apr 15, 2015 at 3:38 PM, Andy Walls <[email protected]> wrote:
>
>> >
>>
>> IMO the right solution would be to avoid ioremapping the whole bar at
>> startup. Instead ioremap pieces once the driver learns what they are.
>> This wouldn't have any of these problems -- you'd ioremap() register
>> regions and you'd ioremap_wc() the framebuffer once you find it. If
>> there are regions of unknown purpose, just don't map them all.
>>
>> Would this be feasible?
>
> Feasible? Maybe.
>
> Worth the time and effort for end-of-life, convential PCI hardware so I
> can have an optimally performing X display on a Standard Def Analog TV
> screen? Nope. I don't have that level of nostalgia.
>

The point is actually to let us unexport or delete mtrr_add. We can
either severely regress performance on ivtv on PAT-capable hardware if
we naively switch it to arch_phys_wc_add or we can do something else.
The something else remains to be determined.

>
> We sort of know where some things are in the MMIO space due to
> experimentation and past efforts examining the firmware binary.
>
> Documentation/video4linux/cx2341x/fw-*.txt documents some things. The
> driver code actually codifies a little bit more knowledge.
>
> The driver code for doing transfers between host and card is complex and
> fragile with some streams that use DMA, other streams that use PIO,
> digging VBI data straight out of card memory, and scatter-gather being
> broken on newer firmwares. Playing around with ioremapping will be hard
> to get right and likely cause something in the code to break for the
> primary use case of the ivtv supported cards.

Ick.

If the only thing that really wants WC is the esoteric framebuffer
thing, could we just switch to arch_phys_wc_add and assume that no one
will care about the regression on new CPUs with ivtv cards?


--Andy

2015-04-16 01:55:32

by Andy Walls

[permalink] [raw]
Subject: Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR

On Wed, 2015-04-15 at 17:58 -0700, Andy Lutomirski wrote:
> On Wed, Apr 15, 2015 at 4:59 PM, Andy Walls <[email protected]> wrote:
> > On Wed, 2015-04-15 at 16:42 -0700, Andy Lutomirski wrote:
> >> On Wed, Apr 15, 2015 at 3:38 PM, Andy Walls <[email protected]> wrote:
> >
> >> >
> >>
> >> IMO the right solution would be to avoid ioremapping the whole bar at
> >> startup. Instead ioremap pieces once the driver learns what they are.
> >> This wouldn't have any of these problems -- you'd ioremap() register
> >> regions and you'd ioremap_wc() the framebuffer once you find it. If
> >> there are regions of unknown purpose, just don't map them all.
> >>
> >> Would this be feasible?
> >
> > Feasible? Maybe.
> >
> > Worth the time and effort for end-of-life, convential PCI hardware so I
> > can have an optimally performing X display on a Standard Def Analog TV
> > screen? Nope. I don't have that level of nostalgia.
> >
>
> The point is actually to let us unexport or delete mtrr_add.

Understood.


> We can
> either severely regress performance on ivtv on PAT-capable hardware if
> we naively switch it to arch_phys_wc_add or we can do something else.
> The something else remains to be determined.

Maybe ioremap the decoder register area as UC, and ioremap the rest of
the decoder region to WC. (Does that suck up too many PAT resources?)
Then add PCI reads following any sort of singleton PCI writes in the WC
region. I assume PCI rules about write postings before reads still
apply when WC is set.

> >
> > We sort of know where some things are in the MMIO space due to
> > experimentation and past efforts examining the firmware binary.
> >
> > Documentation/video4linux/cx2341x/fw-*.txt documents some things. The
> > driver code actually codifies a little bit more knowledge.
> >
> > The driver code for doing transfers between host and card is complex and
> > fragile with some streams that use DMA, other streams that use PIO,
> > digging VBI data straight out of card memory, and scatter-gather being
> > broken on newer firmwares. Playing around with ioremapping will be hard
> > to get right and likely cause something in the code to break for the
> > primary use case of the ivtv supported cards.
>
> Ick.

Yeah.

> If the only thing that really wants WC is the esoteric framebuffer
> thing,

That appears to be it.

> could we just switch to arch_phys_wc_add and assume that no one
> will care about the regression on new CPUs with ivtv cards?

That's on the table in my mind. Not sure if it is the friendliest thing
to do to users. Quite honestly though, modern graphics cards have much
better ouput resolution and performance. Anyone with a modern system
really should be using one. (i.e. MythTV gave up on support for PVR-350
output for video playback years ago in May 2010.)


BTW, my 2005 system with multiple conventional PCI slots in it shows a
'pat' flag in /proc/cpuinfo. (AMD Athlon(tm) 64 X2 Dual Core Processor
4200+) I didn't know it was considered "new". :)

Regards,
Andy

2015-04-16 02:01:34

by Andy Walls

[permalink] [raw]
Subject: Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR

On Thu, 2015-04-16 at 01:58 +0200, Luis R. Rodriguez wrote:
> Hey Andy, thanks for your review, adding Hyong-Youb Kim for review of the
> full range ioremap_wc() idea below.
>
> On Wed, Apr 15, 2015 at 06:38:51PM -0400, Andy Walls wrote:
> > Hi All,
> >
> > On Mon, 2015-04-13 at 19:49 +0200, Luis R. Rodriguez wrote:
> > > From the beginning it seems only framebuffer devices used MTRR/WC,
> > [snip]
> > > The ivtv device is a good example of the worst type of
> > > situations and these days. So perhap __arch_phys_wc_add() and a
> > > ioremap_ucminus() might be something more than transient unless hardware folks
> > > get a good memo or already know how to just Do The Right Thing (TM).
> >
> > Just to reiterate a subtle point, use of the ivtvfb is *optional*. A
> > user may or may not load it. When the user does load the ivtvfb driver,
> > the ivtv driver has already been initialized and may have functions of
> > the card already in use by userspace.
>
> I suspected this and its why I note that a rewrite to address a clean
> split with separate ioremap seems rather difficult in this case.
>
> > Hopefully no one is trying to use the OSD as framebuffer and the video
> > decoder/output engine for video display at the same time.
>
> Worst case concern I have also is the implications of having overlapping
> ioremap() calls (as proposed in my last reply) for different memory types
> and having the different virtual memory addresse used by different parts
> of the driver. Its not clear to me what the hardware implications of this
> are.
>
> > But the video
> > decoder/output device nodes may already be open for performing ioctl()
> > functions so unmapping the decoder IO space out from under them, when
> > loading the ivtvfb driver module, might not be a good thing.
>
> Using overlapping ioremap() calls with different memory types would address
> this concern provided hardware won't barf both on the device and CPU. Hardware
> folks could provide feedback or an ivtvfb user could test the patch supplied
> on both non-PAT and PAT systems. Even so, who knows, this might work on some
> systems while not on others, only hardware folks would know.

The CX2341[56] firmware+hardware has a track record for being really
picky about sytem hardware. It's primary symptoms are for the DMA
engine or Mailbox protocol to get hung up. So yeah, it could barf
easily on some users.

> An alternative... is to just ioremap_wc() the entire region, including
> MMIO registers for these old devices.

That's my thought; as long as implementing PCI write then read can force
writes to be posted and that setting that many pages as WC doesn't cause
some sort of PAT resource exhaustion. (I know very little about PAT).

> I see one ethernet driver that does
> this, myri10ge, and am curious how and why they ended up deciding this
> and if they have run into any issues. I wonder if this is a reasonable
> comrpomise for these 2 remaining corner cases.
>
> Luis

Regards,
Andy

2015-04-16 04:34:41

by Hyong-Youb Kim

[permalink] [raw]
Subject: Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR

On Thu, Apr 16, 2015 at 01:58:16AM +0200, Luis R. Rodriguez wrote:
>
> An alternative... is to just ioremap_wc() the entire region, including
> MMIO registers for these old devices. I see one ethernet driver that does
> this, myri10ge, and am curious how and why they ended up deciding this
> and if they have run into any issues. I wonder if this is a reasonable
> comrpomise for these 2 remaining corner cases.
>

For myri10ge, it a performance thing. Descriptor rings are in NIC
memory BAR0, not in host memory. Say, to send a packet, the driver
writes the send descriptor to the ioremap'd NIC memory. It is a
multi-word descriptor. So, to send it as one PCIE MWr transaction,
the driver maps the whole BAR0 as WC and does "copy descriptor; wmb".

Without WC, descriptors would end up as multiple 4B or 8B MWr packets
to the NIC, which has a pretty big performance impact on this
particular NIC.

Most registers that do not want WC are actually in BAR2, which is not
mapped as WC. For registers that are in BAR0, we do "write to the
register; wmb". If we want to wait till the NIC has seen the write,
we do "write; wmb; read".

This approach has worked for this device for many years. I cannot say
whether it works for other devices, though.

2015-04-16 18:54:41

by Luis Chamberlain

[permalink] [raw]
Subject: Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR

On Thu, Apr 16, 2015 at 01:18:37PM +0900, Hyong-Youb Kim wrote:
> On Thu, Apr 16, 2015 at 01:58:16AM +0200, Luis R. Rodriguez wrote:
> >
> > An alternative... is to just ioremap_wc() the entire region, including
> > MMIO registers for these old devices. I see one ethernet driver that does
> > this, myri10ge, and am curious how and why they ended up deciding this
> > and if they have run into any issues. I wonder if this is a reasonable
> > comrpomise for these 2 remaining corner cases.
> >
>
> For myri10ge, it a performance thing. Descriptor rings are in NIC
> memory BAR0, not in host memory. Say, to send a packet, the driver
> writes the send descriptor to the ioremap'd NIC memory. It is a
> multi-word descriptor. So, to send it as one PCIE MWr transaction,
> the driver maps the whole BAR0 as WC and does "copy descriptor; wmb".

Interesting, so you burst write multi-word descriptor writes using
write-combining here for the Ethernet device.

> Without WC, descriptors would end up as multiple 4B or 8B MWr packets
> to the NIC, which has a pretty big performance impact on this
> particular NIC.

How big are the descriptors?

> Most registers that do not want WC are actually in BAR2, which is not
> mapped as WC. For registers that are in BAR0, we do "write to the
> register; wmb". If we want to wait till the NIC has seen the write,
> we do "write; wmb; read".

Interesting, thanks, yeah using this as a work around to the problem sounds
plausible however it still would require likely making just as many changes to
the ivtv and ipath driver as to just do a proper split. I do wonder however if
this sort of work around can be generalized somehow though so that others could
use, if this sort of thing is going to become prevalent. If so then this would
serve two purposes: work around for the corner cases of MTRR use on Linux and
also these sorts of device constraints.

In order to determine if this is likely to be generally useful could you elaborate
a bit more about the detals of the performance issues of not bursting writes
for the descriptor on this device.

Even if that is done a conversion over to this work around seems it may require
device specific nitpicks. For instance I note in myri10ge_submit_req() for
small writes you just do a reverse write and do the first set last, then
finally the last 32 bits are rewritten, I guess to trigger something?

> This approach has worked for this device for many years. I cannot say
> whether it works for other devices, though.

I think it should but the more interesting question would be exactly
*why* it was needed for this device, who determined that, and why?

Luis

2015-04-16 19:19:36

by Andy Lutomirski

[permalink] [raw]
Subject: Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR

On Apr 15, 2015 6:54 PM, "Andy Walls" <[email protected]> wrote:
>
> On Wed, 2015-04-15 at 17:58 -0700, Andy Lutomirski wrote:
> > On Wed, Apr 15, 2015 at 4:59 PM, Andy Walls <[email protected]> wrote:
> > > On Wed, 2015-04-15 at 16:42 -0700, Andy Lutomirski wrote:
> > >> On Wed, Apr 15, 2015 at 3:38 PM, Andy Walls <[email protected]> wrote:
> > >
> > >> >
> > >>
> > >> IMO the right solution would be to avoid ioremapping the whole bar at
> > >> startup. Instead ioremap pieces once the driver learns what they are.
> > >> This wouldn't have any of these problems -- you'd ioremap() register
> > >> regions and you'd ioremap_wc() the framebuffer once you find it. If
> > >> there are regions of unknown purpose, just don't map them all.
> > >>
> > >> Would this be feasible?
> > >
> > > Feasible? Maybe.
> > >
> > > Worth the time and effort for end-of-life, convential PCI hardware so I
> > > can have an optimally performing X display on a Standard Def Analog TV
> > > screen? Nope. I don't have that level of nostalgia.
> > >
> >
> > The point is actually to let us unexport or delete mtrr_add.
>
> Understood.
>
>
> > We can
> > either severely regress performance on ivtv on PAT-capable hardware if
> > we naively switch it to arch_phys_wc_add or we can do something else.
> > The something else remains to be determined.
>
> Maybe ioremap the decoder register area as UC, and ioremap the rest of
> the decoder region to WC. (Does that suck up too many PAT resources?

PAT resources are unlimited.

> Then add PCI reads following any sort of singleton PCI writes in the WC
> region. I assume PCI rules about write postings before reads still
> apply when WC is set.
>

I think we need sfence, too, but that's easy. We also lose the write
sizes. That is, adjacent writes get combined. Maybe that's okay.

> > >
> > > We sort of know where some things are in the MMIO space due to
> > > experimentation and past efforts examining the firmware binary.
> > >
> > > Documentation/video4linux/cx2341x/fw-*.txt documents some things. The
> > > driver code actually codifies a little bit more knowledge.
> > >
> > > The driver code for doing transfers between host and card is complex and
> > > fragile with some streams that use DMA, other streams that use PIO,
> > > digging VBI data straight out of card memory, and scatter-gather being
> > > broken on newer firmwares. Playing around with ioremapping will be hard
> > > to get right and likely cause something in the code to break for the
> > > primary use case of the ivtv supported cards.
> >
> > Ick.
>
> Yeah.
>
> > If the only thing that really wants WC is the esoteric framebuffer
> > thing,
>
> That appears to be it.
>
> > could we just switch to arch_phys_wc_add and assume that no one
> > will care about the regression on new CPUs with ivtv cards?
>
> That's on the table in my mind. Not sure if it is the friendliest thing
> to do to users. Quite honestly though, modern graphics cards have much
> better ouput resolution and performance. Anyone with a modern system
> really should be using one. (i.e. MythTV gave up on support for PVR-350
> output for video playback years ago in May 2010.)
>
>
> BTW, my 2005 system with multiple conventional PCI slots in it shows a
> 'pat' flag in /proc/cpuinfo. (AMD Athlon(tm) 64 X2 Dual Core Processor
> 4200+) I didn't know it was considered "new". :)

Tons of CPUs have that ability, but we often turn it off due to errata
on older CPUs.

--Andy

>
> Regards,
> Andy
>
>

2015-04-17 08:01:58

by Hyong-Youb Kim

[permalink] [raw]
Subject: Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR

On Thu, Apr 16, 2015 at 08:54:26PM +0200, Luis R. Rodriguez wrote:
> > Without WC, descriptors would end up as multiple 4B or 8B MWr packets
> > to the NIC, which has a pretty big performance impact on this
> > particular NIC.
>
> How big are the descriptors?

Some are 64B (a batch of eight 8B descriptors). Some are 16B.

> > Most registers that do not want WC are actually in BAR2, which is not
> > mapped as WC. For registers that are in BAR0, we do "write to the
> > register; wmb". If we want to wait till the NIC has seen the write,
> > we do "write; wmb; read".
>
> Interesting, thanks, yeah using this as a work around to the problem sounds
> plausible however it still would require likely making just as many changes to
> the ivtv and ipath driver as to just do a proper split. I do wonder however if
> this sort of work around can be generalized somehow though so that others could
> use, if this sort of thing is going to become prevalent. If so then this would
> serve two purposes: work around for the corner cases of MTRR use on Linux and
> also these sorts of device constraints.

These Myricom devices are very non-standard in my opinion, at least in
the Ethernet world. Few, if any, other devices depend so much on WC
like these do. I think almost all devices now have rings in host
memory. The NIC pulls them via DMA. No need for WC, and no need to
special case registers...

> In order to determine if this is likely to be generally useful could you elaborate
> a bit more about the detals of the performance issues of not bursting writes
> for the descriptor on this device.

For this particular Myricom device, performance penalty stems from the
use of slow path in the firmware. They are not about how effectively
we use PCI Express or latency or bandwidth. Small MWr packets end up
casuing slow path processing via the firmware in this device.

There are HPC low latency NICs that use WC for different reasons. To
reduce latency as much as possible, some of these copy small packets
to the NIC memory via PIO (BAR0, and so on), instead of DMA. They
want WC mapping to minimize PCI Express packets/transactions.

I do not know about video adapters and their use for WC.

> Even if that is done a conversion over to this work around seems it may require
> device specific nitpicks. For instance I note in myri10ge_submit_req() for
> small writes you just do a reverse write and do the first set last, then
> finally the last 32 bits are rewritten, I guess to trigger something?

Right. The device polls the last word to start sending, DMA, etc.

> > This approach has worked for this device for many years. I cannot say
> > whether it works for other devices, though.
>
> I think it should but the more interesting question would be exactly
> *why* it was needed for this device, who determined that, and why?

Hopefully, the above text answers some of your questions.

2015-04-21 17:36:01

by Luis Chamberlain

[permalink] [raw]
Subject: Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR

On Wed, Apr 15, 2015 at 05:58:14PM -0700, Andy Lutomirski wrote:
> On Wed, Apr 15, 2015 at 4:59 PM, Andy Walls <[email protected]> wrote:
> > On Wed, 2015-04-15 at 16:42 -0700, Andy Lutomirski wrote:
> >> On Wed, Apr 15, 2015 at 3:38 PM, Andy Walls <[email protected]> wrote:
> >
> >> >
> >>
> >> IMO the right solution would be to avoid ioremapping the whole bar at
> >> startup. Instead ioremap pieces once the driver learns what they are.
> >> This wouldn't have any of these problems -- you'd ioremap() register
> >> regions and you'd ioremap_wc() the framebuffer once you find it. If
> >> there are regions of unknown purpose, just don't map them all.
> >>
> >> Would this be feasible?
> >
> > Feasible? Maybe.
> >
> > Worth the time and effort for end-of-life, convential PCI hardware so I
> > can have an optimally performing X display on a Standard Def Analog TV
> > screen? Nope. I don't have that level of nostalgia.
> >
>
> The point is actually to let us unexport or delete mtrr_add. We can
> either severely regress performance on ivtv on PAT-capable hardware if
> we naively switch it to arch_phys_wc_add or we can do something else.
> The something else remains to be determined.

Back to square one: I can't come up with anything not too instrusive
or that dotes not requires substantial amount of work as an alternative to
removing MTRR completely right now (with the long term goal of also
making strong UC default) and its because of 2 device drivers:

* ivtv: firmware API is poo and device is legacy, no one cares
* ipath: driver needs a clean split and work is considerable,
maintainers have not been responsive, do they care?

What do we want to do with these drivers? Let us be straight shooters,
if we are serious about having a performance regression on the drivers
for the sake of removing MTRR why not just seriously discuss removal
of these drivers. This way we can remain sane, upkeep a policy to
never even consider overlapping ioremap*() calls, and have a clean
expected strategy we can expect for new drivers.

I'm going to split up my patches now into 4 series:

1) things which are straight forward in converting drivers over
to arch_phys_wc_add() and ioremap_wc(). These are subsystem
wide though, so just a heads up, my hope is that each subsystem
maintainer can take their own series unless someone else is
comfortable in taking this in for x86

2) a few helpers in the like of ioremap_wc() needed for other drivers.
These are straight forward but since they depend on x86 / core
helpers it would be nice for them to go I guess through x86 folks.
What maintainer is up to take these?

3) MTRR run time changes

4) corner cases - TBD - lets discuss here what we want to do with
ivtv and ipath. I will however remove fusion's mtrr code use
as its all commented out.

Luis

2015-04-21 22:02:29

by Luis Chamberlain

[permalink] [raw]
Subject: Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR

On Wed, Apr 15, 2015 at 09:07:37PM -0400, Andy Walls wrote:
> On Thu, 2015-04-16 at 01:58 +0200, Luis R. Rodriguez wrote:
> > Hey Andy, thanks for your review, adding Hyong-Youb Kim for review of the
> > full range ioremap_wc() idea below.
> >
> > On Wed, Apr 15, 2015 at 06:38:51PM -0400, Andy Walls wrote:
> > > Hi All,
> > >
> > > On Mon, 2015-04-13 at 19:49 +0200, Luis R. Rodriguez wrote:
> > > > From the beginning it seems only framebuffer devices used MTRR/WC,
> > > [snip]
> > > > The ivtv device is a good example of the worst type of
> > > > situations and these days. So perhap __arch_phys_wc_add() and a
> > > > ioremap_ucminus() might be something more than transient unless hardware folks
> > > > get a good memo or already know how to just Do The Right Thing (TM).
> > >
> > > Just to reiterate a subtle point, use of the ivtvfb is *optional*. A
> > > user may or may not load it. When the user does load the ivtvfb driver,
> > > the ivtv driver has already been initialized and may have functions of
> > > the card already in use by userspace.
> >
> > I suspected this and its why I note that a rewrite to address a clean
> > split with separate ioremap seems rather difficult in this case.
> >
> > > Hopefully no one is trying to use the OSD as framebuffer and the video
> > > decoder/output engine for video display at the same time.
> >
> > Worst case concern I have also is the implications of having overlapping
> > ioremap() calls (as proposed in my last reply) for different memory types
> > and having the different virtual memory addresse used by different parts
> > of the driver. Its not clear to me what the hardware implications of this
> > are.
> >
> > > But the video
> > > decoder/output device nodes may already be open for performing ioctl()
> > > functions so unmapping the decoder IO space out from under them, when
> > > loading the ivtvfb driver module, might not be a good thing.
> >
> > Using overlapping ioremap() calls with different memory types would address
> > this concern provided hardware won't barf both on the device and CPU. Hardware
> > folks could provide feedback or an ivtvfb user could test the patch supplied
> > on both non-PAT and PAT systems. Even so, who knows, this might work on some
> > systems while not on others, only hardware folks would know.
>
> The CX2341[56] firmware+hardware has a track record for being really
> picky about sytem hardware. It's primary symptoms are for the DMA
> engine or Mailbox protocol to get hung up. So yeah, it could barf
> easily on some users.
>
> > An alternative... is to just ioremap_wc() the entire region, including
> > MMIO registers for these old devices.
>
> That's my thought; as long as implementing PCI write then read can force
> writes to be posted and that setting that many pages as WC doesn't cause
> some sort of PAT resource exhaustion. (I know very little about PAT).

So upon review that strategy won't work well unless we implemnt some
sort of of hack on the driver. That's also quite a bit of work.

Andy, can we live without MTRR support on this driver for future kernels? This
would only leave ipath as the only offending driver.

Luis

2015-04-21 22:08:34

by Luis Chamberlain

[permalink] [raw]
Subject: Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR

On Tue, Apr 21, 2015 at 3:02 PM, Luis R. Rodriguez <[email protected]> wrote:
> Andy, can we live without MTRR support on this driver for future kernels? This
> would only leave ipath as the only offending driver.

Sorry to be clear, can we live with removal of write-combining on this driver?

Luis

2015-04-21 22:09:31

by Andy Lutomirski

[permalink] [raw]
Subject: Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR

On Tue, Apr 21, 2015 at 3:08 PM, Luis R. Rodriguez <[email protected]> wrote:
> On Tue, Apr 21, 2015 at 3:02 PM, Luis R. Rodriguez <[email protected]> wrote:
>> Andy, can we live without MTRR support on this driver for future kernels? This
>> would only leave ipath as the only offending driver.
>
> Sorry to be clear, can we live with removal of write-combining on this driver?
>

I personally think so, but a driver maintainer's ack would be nice.

--Andy