diff options
| author | Michael Ellerman <mpe@ellerman.id.au> | 2015-04-14 09:29:23 +1000 | 
|---|---|---|
| committer | Michael Ellerman <mpe@ellerman.id.au> | 2015-04-14 09:29:23 +1000 | 
| commit | ad30cb9946515f72af5c3e89ad9de18870c1a1e7 (patch) | |
| tree | b4c9b385e18ae8128da1522055a17b453f2309fc /Documentation/powerpc | |
| parent | b0a478ede669949682b9c698f6146c0065543b91 (diff) | |
| parent | d4ed11aa4881246e1e36e0189f30f053f140370c (diff) | |
| download | linux-ad30cb9946515f72af5c3e89ad9de18870c1a1e7.tar.bz2 | |
Merge branch 'next-sriov' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc into next
Merge Richard's work to support SR-IOV on PowerNV. All generic PCI
patches acked by Bjorn.
Some minor conflicts with Daniel's pci_controller_ops work.
Conflicts:
	arch/powerpc/include/asm/machdep.h
	arch/powerpc/platforms/powernv/pci-ioda.c
Diffstat (limited to 'Documentation/powerpc')
| -rw-r--r-- | Documentation/powerpc/pci_iov_resource_on_powernv.txt | 301 | 
1 files changed, 301 insertions, 0 deletions
diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt new file mode 100644 index 000000000000..b55c5cd83f8d --- /dev/null +++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt @@ -0,0 +1,301 @@ +Wei Yang <weiyang@linux.vnet.ibm.com> +Benjamin Herrenschmidt <benh@au1.ibm.com> +Bjorn Helgaas <bhelgaas@google.com> +26 Aug 2014 + +This document describes the requirement from hardware for PCI MMIO resource +sizing and assignment on PowerKVM and how generic PCI code handles this +requirement. The first two sections describe the concepts of Partitionable +Endpoints and the implementation on P8 (IODA2). The next two sections talks +about considerations on enabling SRIOV on IODA2. + +1. Introduction to Partitionable Endpoints + +A Partitionable Endpoint (PE) is a way to group the various resources +associated with a device or a set of devices to provide isolation between +partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism +to freeze a device that is causing errors in order to limit the possibility +of propagation of bad data. + +There is thus, in HW, a table of PE states that contains a pair of "frozen" +state bits (one for MMIO and one for DMA, they get set together but can be +cleared independently) for each PE. + +When a PE is frozen, all stores in any direction are dropped and all loads +return all 1's value. MSIs are also blocked. There's a bit more state that +captures things like the details of the error that caused the freeze etc., but +that's not critical. + +The interesting part is how the various PCIe transactions (MMIO, DMA, ...) +are matched to their corresponding PEs. + +The following section provides a rough description of what we have on P8 +(IODA2).  Keep in mind that this is all per PHB (PCI host bridge).  Each PHB +is a completely separate HW entity that replicates the entire logic, so has +its own set of PEs, etc. + +2. Implementation of Partitionable Endpoints on P8 (IODA2) + +P8 supports up to 256 Partitionable Endpoints per PHB. + +  * Inbound + +    For DMA, MSIs and inbound PCIe error messages, we have a table (in +    memory but accessed in HW by the chip) that provides a direct +    correspondence between a PCIe RID (bus/dev/fn) with a PE number. +    We call this the RTT. + +    - For DMA we then provide an entire address space for each PE that can +      contain two "windows", depending on the value of PCI address bit 59. +      Each window can be configured to be remapped via a "TCE table" (IOMMU +      translation table), which has various configurable characteristics +      not described here. + +    - For MSIs, we have two windows in the address space (one at the top of +      the 32-bit space and one much higher) which, via a combination of the +      address and MSI value, will result in one of the 2048 interrupts per +      bridge being triggered.  There's a PE# in the interrupt controller +      descriptor table as well which is compared with the PE# obtained from +      the RTT to "authorize" the device to emit that specific interrupt. + +    - Error messages just use the RTT. + +  * Outbound.  That's where the tricky part is. + +    Like other PCI host bridges, the Power8 IODA2 PHB supports "windows" +    from the CPU address space to the PCI address space.  There is one M32 +    window and sixteen M64 windows.  They have different characteristics. +    First what they have in common: they forward a configurable portion of +    the CPU address space to the PCIe bus and must be naturally aligned +    power of two in size.  The rest is different: + +    - The M32 window: + +      * Is limited to 4GB in size. + +      * Drops the top bits of the address (above the size) and replaces +	them with a configurable value.  This is typically used to generate +	32-bit PCIe accesses.  We configure that window at boot from FW and +	don't touch it from Linux; it's usually set to forward a 2GB +	portion of address space from the CPU to PCIe +	0x8000_0000..0xffff_ffff.  (Note: The top 64KB are actually +	reserved for MSIs but this is not a problem at this point; we just +	need to ensure Linux doesn't assign anything there, the M32 logic +	ignores that however and will forward in that space if we try). + +      * It is divided into 256 segments of equal size.  A table in the chip +	maps each segment to a PE#.  That allows portions of the MMIO space +	to be assigned to PEs on a segment granularity.  For a 2GB window, +	the segment granularity is 2GB/256 = 8MB. + +    Now, this is the "main" window we use in Linux today (excluding +    SR-IOV).  We basically use the trick of forcing the bridge MMIO windows +    onto a segment alignment/granularity so that the space behind a bridge +    can be assigned to a PE. + +    Ideally we would like to be able to have individual functions in PEs +    but that would mean using a completely different address allocation +    scheme where individual function BARs can be "grouped" to fit in one or +    more segments. + +    - The M64 windows: + +      * Must be at least 256MB in size. + +      * Do not translate addresses (the address on PCIe is the same as the +	address on the PowerBus).  There is a way to also set the top 14 +	bits which are not conveyed by PowerBus but we don't use this. + +      * Can be configured to be segmented.  When not segmented, we can +	specify the PE# for the entire window.  When segmented, a window +	has 256 segments; however, there is no table for mapping a segment +	to a PE#.  The segment number *is* the PE#. + +      * Support overlaps.  If an address is covered by multiple windows, +	there's a defined ordering for which window applies. + +    We have code (fairly new compared to the M32 stuff) that exploits that +    for large BARs in 64-bit space: + +    We configure an M64 window to cover the entire region of address space +    that has been assigned by FW for the PHB (about 64GB, ignore the space +    for the M32, it comes out of a different "reserve").  We configure it +    as segmented. + +    Then we do the same thing as with M32, using the bridge alignment +    trick, to match to those giant segments. + +    Since we cannot remap, we have two additional constraints: + +    - We do the PE# allocation *after* the 64-bit space has been assigned +      because the addresses we use directly determine the PE#.  We then +      update the M32 PE# for the devices that use both 32-bit and 64-bit +      spaces or assign the remaining PE# to 32-bit only devices. + +    - We cannot "group" segments in HW, so if a device ends up using more +      than one segment, we end up with more than one PE#.  There is a HW +      mechanism to make the freeze state cascade to "companion" PEs but +      that only works for PCIe error messages (typically used so that if +      you freeze a switch, it freezes all its children).  So we do it in +      SW.  We lose a bit of effectiveness of EEH in that case, but that's +      the best we found.  So when any of the PEs freezes, we freeze the +      other ones for that "domain".  We thus introduce the concept of +      "master PE" which is the one used for DMA, MSIs, etc., and "secondary +      PEs" that are used for the remaining M64 segments. + +    We would like to investigate using additional M64 windows in "single +    PE" mode to overlay over specific BARs to work around some of that, for +    example for devices with very large BARs, e.g., GPUs.  It would make +    sense, but we haven't done it yet. + +3. Considerations for SR-IOV on PowerKVM + +  * SR-IOV Background + +    The PCIe SR-IOV feature allows a single Physical Function (PF) to +    support several Virtual Functions (VFs).  Registers in the PF's SR-IOV +    Capability control the number of VFs and whether they are enabled. + +    When VFs are enabled, they appear in Configuration Space like normal +    PCI devices, but the BARs in VF config space headers are unusual.  For +    a non-VF device, software uses BARs in the config space header to +    discover the BAR sizes and assign addresses for them.  For VF devices, +    software uses VF BAR registers in the *PF* SR-IOV Capability to +    discover sizes and assign addresses.  The BARs in the VF's config space +    header are read-only zeros. + +    When a VF BAR in the PF SR-IOV Capability is programmed, it sets the +    base address for all the corresponding VF(n) BARs.  For example, if the +    PF SR-IOV Capability is programmed to enable eight VFs, and it has a +    1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region. +    This region is divided into eight contiguous 1MB regions, each of which +    is a BAR0 for one of the VFs.  Note that even though the VF BAR +    describes an 8MB region, the alignment requirement is for a single VF, +    i.e., 1MB in this example. + +  There are several strategies for isolating VFs in PEs: + +  - M32 window: There's one M32 window, and it is split into 256 +    equally-sized segments.  The finest granularity possible is a 256MB +    window with 1MB segments.  VF BARs that are 1MB or larger could be +    mapped to separate PEs in this window.  Each segment can be +    individually mapped to a PE via the lookup table, so this is quite +    flexible, but it works best when all the VF BARs are the same size.  If +    they are different sizes, the entire window has to be small enough that +    the segment size matches the smallest VF BAR, which means larger VF +    BARs span several segments. + +  - Non-segmented M64 window: A non-segmented M64 window is mapped entirely +    to a single PE, so it could only isolate one VF. + +  - Single segmented M64 windows: A segmented M64 window could be used just +    like the M32 window, but the segments can't be individually mapped to +    PEs (the segment number is the PE#), so there isn't as much +    flexibility.  A VF with multiple BARs would have to be in a "domain" of +    multiple PEs, which is not as well isolated as a single PE. + +  - Multiple segmented M64 windows: As usual, each window is split into 256 +    equally-sized segments, and the segment number is the PE#.  But if we +    use several M64 windows, they can be set to different base addresses +    and different segment sizes.  If we have VFs that each have a 1MB BAR +    and a 32MB BAR, we could use one M64 window to assign 1MB segments and +    another M64 window to assign 32MB segments. + +  Finally, the plan to use M64 windows for SR-IOV, which will be described +  more in the next two sections.  For a given VF BAR, we need to +  effectively reserve the entire 256 segments (256 * VF BAR size) and +  position the VF BAR to start at the beginning of a free range of +  segments/PEs inside that M64 window. + +  The goal is of course to be able to give a separate PE for each VF. + +  The IODA2 platform has 16 M64 windows, which are used to map MMIO +  range to PE#.  Each M64 window defines one MMIO range and this range is +  divided into 256 segments, with each segment corresponding to one PE. + +  We decide to leverage this M64 window to map VFs to individual PEs, since +  SR-IOV VF BARs are all the same size. + +  But doing so introduces another problem: total_VFs is usually smaller +  than the number of M64 window segments, so if we map one VF BAR directly +  to one M64 window, some part of the M64 window will map to another +  device's MMIO range. + +  IODA supports 256 PEs, so segmented windows contain 256 segments, so if +  total_VFs is less than 256, we have the situation in Figure 1.0, where +  segments [total_VFs, 255] of the M64 window may map to some MMIO range on +  other devices: + +     0      1                     total_VFs - 1 +     +------+------+-     -+------+------+ +     |      |      |  ...  |      |      | +     +------+------+-     -+------+------+ + +                           VF(n) BAR space + +     0      1                     total_VFs - 1                255 +     +------+------+-     -+------+------+-      -+------+------+ +     |      |      |  ...  |      |      |   ...  |      |      | +     +------+------+-     -+------+------+-      -+------+------+ + +                           M64 window + +		Figure 1.0 Direct map VF(n) BAR space + +  Our current solution is to allocate 256 segments even if the VF(n) BAR +  space doesn't need that much, as shown in Figure 1.1: + +     0      1                     total_VFs - 1                255 +     +------+------+-     -+------+------+-      -+------+------+ +     |      |      |  ...  |      |      |   ...  |      |      | +     +------+------+-     -+------+------+-      -+------+------+ + +                           VF(n) BAR space + extra + +     0      1                     total_VFs - 1                255 +     +------+------+-     -+------+------+-      -+------+------+ +     |      |      |  ...  |      |      |   ...  |      |      | +     +------+------+-     -+------+------+-      -+------+------+ + +			   M64 window + +		Figure 1.1 Map VF(n) BAR space + extra + +  Allocating the extra space ensures that the entire M64 window will be +  assigned to this one SR-IOV device and none of the space will be +  available for other devices.  Note that this only expands the space +  reserved in software; there are still only total_VFs VFs, and they only +  respond to segments [0, total_VFs - 1].  There's nothing in hardware that +  responds to segments [total_VFs, 255]. + +4. Implications for the Generic PCI Code + +The PCIe SR-IOV spec requires that the base of the VF(n) BAR space be +aligned to the size of an individual VF BAR. + +In IODA2, the MMIO address determines the PE#.  If the address is in an M32 +window, we can set the PE# by updating the table that translates segments +to PE#s.  Similarly, if the address is in an unsegmented M64 window, we can +set the PE# for the window.  But if it's in a segmented M64 window, the +segment number is the PE#. + +Therefore, the only way to control the PE# for a VF is to change the base +of the VF(n) BAR space in the VF BAR.  If the PCI core allocates the exact +amount of space required for the VF(n) BAR space, the VF BAR value is fixed +and cannot be changed. + +On the other hand, if the PCI core allocates additional space, the VF BAR +value can be changed as long as the entire VF(n) BAR space remains inside +the space allocated by the core. + +Ideally the segment size will be the same as an individual VF BAR size. +Then each VF will be in its own PE.  The VF BARs (and therefore the PE#s) +are contiguous.  If VF0 is in PE(x), then VF(n) is in PE(x+n).  If we +allocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0. + +If the segment size is smaller than the VF BAR size, it will take several +segments to cover a VF BAR, and a VF will be in several PEs.  This is +possible, but the isolation isn't as good, and it reduces the number of PE# +choices because instead of consuming only numVFs segments, the VF(n) BAR +space will consume (numVFs * n) segments.  That means there aren't as many +available segments for adjusting base of the VF(n) BAR space.  |