summaryrefslogtreecommitdiffstats
path: root/drivers/vfio
AgeCommit message (Collapse)AuthorFilesLines
2014-10-15Merge tag 'iommu-updates-v3.18' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull IOMMU updates from Joerg Roedel: "This pull-request includes: - change in the IOMMU-API to convert the former iommu_domain_capable function to just iommu_capable - various fixes in handling RMRR ranges for the VT-d driver (one fix requires a device driver core change which was acked by Greg KH) - the AMD IOMMU driver now assigns and deassigns complete alias groups to fix issues with devices using the wrong PCI request-id - MMU-401 support for the ARM SMMU driver - multi-master IOMMU group support for the ARM SMMU driver - various other small fixes all over the place" * tag 'iommu-updates-v3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (41 commits) iommu/vt-d: Work around broken RMRR firmware entries iommu/vt-d: Store bus information in RMRR PCI device path iommu/vt-d: Only remove domain when device is removed driver core: Add BUS_NOTIFY_REMOVED_DEVICE event iommu/amd: Fix devid mapping for ivrs_ioapic override iommu/irq_remapping: Fix the regression of hpet irq remapping iommu: Fix bus notifier breakage iommu/amd: Split init_iommu_group() from iommu_init_device() iommu: Rework iommu_group_get_for_pci_dev() iommu: Make of_device_id array const amd_iommu: do not dereference a NULL pointer address. iommu/omap: Remove omap_iommu unused owner field iommu: Remove iommu_domain_has_cap() API function IB/usnic: Convert to use new iommu_capable() API function vfio: Convert to use new iommu_capable() API function kvm: iommu: Convert to use new iommu_capable() API function iommu/tegra: Convert to iommu_capable() API function iommu/msm: Convert to iommu_capable() API function iommu/vt-d: Convert to iommu_capable() API function iommu/fsl: Convert to iommu_capable() API function ...
2014-10-11Merge tag 'vfio-v3.18-rc1' of git://github.com/awilliam/linux-vfioLinus Torvalds4-85/+98
Pull VFIO updates from Alex Williamson: - Nested IOMMU extension to type1 (Will Deacon) - Restore MSIx message before enabling (Gavin Shan) - Fix remove path locking (Alex Williamson) * tag 'vfio-v3.18-rc1' of git://github.com/awilliam/linux-vfio: vfio-pci: Fix remove path locking drivers/vfio: Export vfio_spapr_iommu_eeh_ioctl() with GPL vfio/pci: Restore MSIx message prior to enabling PCI: Export MSI message relevant functions vfio/iommu_type1: add new VFIO_TYPE1_NESTING_IOMMU IOMMU type iommu: introduce domain attribute for nesting IOMMUs
2014-09-29vfio-pci: Fix remove path lockingAlex Williamson1-79/+57
Locking both the remove() and release() path results in a deadlock that should have been obvious. To fix this we can get and hold the vfio_device reference as we evaluate whether to do a bus/slot reset. This will automatically block any remove() calls, allowing us to remove the explict lock. Fixes 61d792562b53. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Cc: stable@vger.kernel.org [3.17]
2014-09-29drivers/vfio: Export vfio_spapr_iommu_eeh_ioctl() with GPLGavin Shan1-1/+1
The function should have been exported with EXPORT_SYMBOL_GPL() as part of commit 92d18a6851fb ("drivers/vfio: Fix EEH build error"). Suggested-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-09-29vfio/pci: Restore MSIx message prior to enablingGavin Shan1-0/+15
The MSIx vector table lives in device memory, which may be cleared as part of a backdoor device reset. This is the case on the IBM IPR HBA when the BIST is run on the device. When assigned to a QEMU guest, the guest driver does a pci_save_state(), issues a BIST, then does a pci_restore_state(). The BIST clears the MSIx vector table, but due to the way interrupts are configured the pci_restore_state() does not restore the vector table as expected. Eventually this results in an EEH error on Power platforms when the device attempts to signal an interrupt with the zero'd table entry. Fix the problem by restoring the host cached MSI message prior to enabling each vector. Reported-by: Wen Xiong <wenxiong@linux.vnet.ibm.com> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-09-29vfio/iommu_type1: add new VFIO_TYPE1_NESTING_IOMMU IOMMU typeWill Deacon1-5/+25
VFIO allows devices to be safely handed off to userspace by putting them behind an IOMMU configured to ensure DMA and interrupt isolation. This enables userspace KVM clients, such as kvmtool and qemu, to further map the device into a virtual machine. With IOMMUs such as the ARM SMMU, it is then possible to provide SMMU translation services to the guest operating system, which are nested with the existing translation installed by VFIO. However, enabling this feature means that the IOMMU driver must be informed that the VFIO domain is being created for the purposes of nested translation. This patch adds a new IOMMU type (VFIO_TYPE1_NESTING_IOMMU) to the VFIO type-1 driver. The new IOMMU type acts identically to the VFIO_TYPE1v2_IOMMU type, but additionally sets the DOMAIN_ATTR_NESTING attribute on its IOMMU domains. Cc: Joerg Roedel <joro@8bytes.org> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-09-25PCI/AER: Rename PCI_ERR_UNC_TRAIN to PCI_ERR_UNC_UNDChen, Gong1-1/+1
In PCIe r1.0, sec 5.10.2, bit 0 of the Uncorrectable Error Status, Mask, and Severity Registers was for "Training Error." In PCIe r1.1, sec 7.10.2, bit 0 was redefined to be "Undefined." Rename PCI_ERR_UNC_TRAIN to PCI_ERR_UNC_UND to reflect this change. No functional change. [bhelgaas: changelog] Signed-off-by: Chen, Gong <gong.chen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2014-09-25vfio: Convert to use new iommu_capable() API functionJoerg Roedel1-2/+2
Cc: Alex Williamson <alex.williamson@redhat.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2014-08-08drivers/vfio: Enable VFIO if EEH is not supportedAlexey Kardashevskiy2-7/+3
The existing vfio_pci_open() fails upon error returned from vfio_spapr_pci_eeh_open(), which breaks POWER7's P5IOC2 PHB support which this patch brings back. The patch fixes the issue by dropping the return value of vfio_spapr_pci_eeh_open(). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-08-08drivers/vfio: Allow EEH to be built as moduleAlexey Kardashevskiy1-0/+10
This adds necessary declarations to the SPAPR VFIO EEH module, otherwise multiple dynamic linker errors reported: vfio_spapr_eeh: Unknown symbol eeh_pe_set_option (err 0) vfio_spapr_eeh: Unknown symbol eeh_pe_configure (err 0) vfio_spapr_eeh: Unknown symbol eeh_pe_reset (err 0) vfio_spapr_eeh: Unknown symbol eeh_pe_get_state (err 0) vfio_spapr_eeh: Unknown symbol eeh_iommu_group_to_pe (err 0) vfio_spapr_eeh: Unknown symbol eeh_dev_open (err 0) vfio_spapr_eeh: Unknown symbol eeh_pe_set_option (err 0) vfio_spapr_eeh: Unknown symbol eeh_pe_configure (err 0) vfio_spapr_eeh: Unknown symbol eeh_pe_reset (err 0) vfio_spapr_eeh: Unknown symbol eeh_pe_get_state (err 0) vfio_spapr_eeh: Unknown symbol eeh_iommu_group_to_pe (err 0) vfio_spapr_eeh: Unknown symbol eeh_dev_open (err 0) Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-08-08drivers/vfio: Fix EEH build errorGavin Shan3-1/+10
The VFIO related components could be built as dynamic modules. Unfortunately, CONFIG_EEH can't be configured to "m". The patch fixes the build errors when configuring VFIO related components as dynamic modules as follows: CC [M] drivers/vfio/vfio_iommu_spapr_tce.o In file included from drivers/vfio/vfio.c:33:0: include/linux/vfio.h:101:43: warning: ‘struct pci_dev’ declared \ inside parameter list [enabled by default] : WRAP arch/powerpc/boot/zImage.pseries WRAP arch/powerpc/boot/zImage.maple WRAP arch/powerpc/boot/zImage.pmac WRAP arch/powerpc/boot/zImage.epapr MODPOST 1818 modules ERROR: ".vfio_spapr_iommu_eeh_ioctl" [drivers/vfio/vfio_iommu_spapr_tce.ko]\ undefined! ERROR: ".vfio_spapr_pci_eeh_open" [drivers/vfio/pci/vfio-pci.ko] undefined! ERROR: ".vfio_spapr_pci_eeh_release" [drivers/vfio/pci/vfio-pci.ko] undefined! Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-08-07vfio-pci: Attempt bus/slot reset on releaseAlex Williamson2-0/+113
Each time a device is released, mark whether a local reset was successful or whether a bus/slot reset is needed. If a reset is needed and all of the affected devices are bound to vfio-pci and unused, allow the reset. This is most useful when the userspace driver is killed and releases all the devices in an unclean state, such as when a QEMU VM quits. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-08-07vfio-pci: Use mutex around open, release, and removeAlex Williamson2-12/+23
Serializing open/release allows us to fix a refcnt error if we fail to enable the device and lets us prevent devices from being unbound or opened, giving us an opportunity to do bus resets on release. No restriction added to serialize binding devices to vfio-pci while the mutex is held though. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-08-07vfio-pci: Release devices with BusMaster disabledAlex Williamson1-2/+8
Our current open/release path looks like this: vfio_pci_open vfio_pci_enable pci_enable_device pci_save_state pci_store_saved_state vfio_pci_release vfio_pci_disable pci_disable_device pci_restore_state pci_enable_device() doesn't modify PCI_COMMAND_MASTER, so if a device comes to us with it enabled, it persists through the open and gets stored as part of the device saved state. We then restore that saved state when released, which can allow the device to attempt to continue to do DMA. When the group is disconnected from the domain, this will get caught by the IOMMU, but if there are other devices in the group, the device may continue running and interfere with the user. Even in the former case, IOMMUs don't necessarily behave well and a stream of blocked DMA can result in unpleasant behavior on the host. Explicitly disable Bus Master as we're enabling the device and slightly re-work release to make sure that pci_disable_device() is the last thing that touches the device. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-08-05drivers/vfio: EEH support for VFIO PCI deviceGavin Shan4-5/+118
The patch adds new IOCTL commands for sPAPR VFIO container device to support EEH functionality for PCI devices, which have been passed through from host to somebody else via VFIO. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Alexander Graf <agraf@suse.de> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-06-07Merge tag 'vfio-v3.16-rc1' of git://github.com/awilliam/linux-vfio into nextLinus Torvalds3-34/+24
Pull VFIO updates from Alex Williamson: "A handful of VFIO bug fixes for v3.16" * tag 'vfio-v3.16-rc1' of git://github.com/awilliam/linux-vfio: drivers/vfio/pci: Fix wrong MSI interrupt count drivers/vfio: Rework offsetofend() vfio/iommu_type1: Avoid overflow vfio/pci: Fix unchecked return value vfio/pci: Fix sizing of DPA and THP express capabilities
2014-05-30drivers/vfio/pci: Fix wrong MSI interrupt countGavin Shan1-2/+1
According PCI local bus specification, the register of Message Control for MSI (offset: 2, length: 2) has bit#0 to enable or disable MSI logic and it shouldn't be part contributing to the calculation of MSI interrupt count. The patch fixes the issue. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-05-30vfio/iommu_type1: Avoid overflowAlex Williamson1-27/+18
Coverity reports use of a tained scalar used as a loop boundary. For the most part, any values passed from userspace for a DMA mapping size, IOVA, or virtual address are valid, with some alignment constraints. The size is ultimately bound by how many pages the user is able to lock, IOVA is tested by the IOMMU driver when doing a map, and the virtual address needs to pass get_user_pages. The only problem I can find is that we do expect the __u64 user values to fit within our variables, which might not happen on 32bit platforms. Add a test for this and return error on overflow. Also propagate use of the type-correct local variables throughout the function. The above also points to the 'end' variable, which can be zero if we're operating at the very top of the address space. We try to account for this, but our loop botches it. Rework the loop to use the remaining size as our loop condition rather than the IOVA vs end. Detected by Coverity: CID 714659 Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-05-30vfio/pci: Fix unchecked return valueAlex Williamson1-1/+2
There's nothing we can do different if pci_load_and_free_saved_state() fails, other than maybe print some log message, but the actual re-load of the state is an unnecessary step here since we've only just saved it. We can cleanup a coverity warning and eliminate the unnecessary step by freeing the state ourselves. Detected by Coverity: CID 753101 Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-05-30vfio/pci: Fix sizing of DPA and THP express capabilitiesAlex Williamson1-4/+3
When sizing the TPH capability we store the register containing the table size into the 'dword' variable, but then use the uninitialized 'byte' variable to analyze the size. The table size is also actually reported as an N-1 value, so correct sizing to account for this. The round_up() for both TPH and DPA is unnecessary, remove it. Detected by Coverity: CID 714665 & 715156 Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-05-27driver core: dev_set_drvdata can no longer failJean Delvare1-7/+1
So there is no point in checking its return value, which will soon disappear. Signed-off-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-03Merge tag 'vfio-v3.15-rc1' of git://github.com/awilliam/linux-vfioLinus Torvalds3-301/+362
Pull VFIO updates from Alex Williamson: "VFIO updates for v3.15 include: - Allow the vfio-type1 IOMMU to support multiple domains within a container - Plumb path to query whether all domains are cache-coherent - Wire query into kvm-vfio device to avoid KVM x86 WBINVD emulation - Always select CONFIG_ANON_INODES, vfio depends on it (Arnd) The first patch also makes the vfio-type1 IOMMU driver completely independent of the bus_type of the devices it's handling, which enables it to be used for both vfio-pci and a future vfio-platform (and hopefully combinations involving both simultaneously)" * tag 'vfio-v3.15-rc1' of git://github.com/awilliam/linux-vfio: vfio: always select ANON_INODES kvm/vfio: Support for DMA coherent IOMMUs vfio: Add external user check extension interface vfio/type1: Add extension to test DMA cache coherence of IOMMU vfio/iommu_type1: Multi-IOMMU domain support
2014-04-01Merge tag 'pci-v3.15-changes' of ↵Linus Torvalds1-4/+8
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI changes from Bjorn Helgaas: "Enumeration - Increment max correctly in pci_scan_bridge() (Andreas Noever) - Clarify the "scan anyway" comment in pci_scan_bridge() (Andreas Noever) - Assign CardBus bus number only during the second pass (Andreas Noever) - Use request_resource_conflict() instead of insert_ for bus numbers (Andreas Noever) - Make sure bus number resources stay within their parents bounds (Andreas Noever) - Remove pci_fixup_parent_subordinate_busnr() (Andreas Noever) - Check for child busses which use more bus numbers than allocated (Andreas Noever) - Don't scan random busses in pci_scan_bridge() (Andreas Noever) - x86: Drop pcibios_scan_root() check for bus already scanned (Bjorn Helgaas) - x86: Use pcibios_scan_root() instead of pci_scan_bus_with_sysdata() (Bjorn Helgaas) - x86: Use pcibios_scan_root() instead of pci_scan_bus_on_node() (Bjorn Helgaas) - x86: Merge pci_scan_bus_on_node() into pcibios_scan_root() (Bjorn Helgaas) - x86: Drop return value of pcibios_scan_root() (Bjorn Helgaas) NUMA - x86: Add x86_pci_root_bus_node() to look up NUMA node from PCI bus (Bjorn Helgaas) - x86: Use x86_pci_root_bus_node() instead of get_mp_bus_to_node() (Bjorn Helgaas) - x86: Remove mp_bus_to_node[], set_mp_bus_to_node(), get_mp_bus_to_node() (Bjorn Helgaas) - x86: Use NUMA_NO_NODE, not -1, for unknown node (Bjorn Helgaas) - x86: Remove acpi_get_pxm() usage (Bjorn Helgaas) - ia64: Use NUMA_NO_NODE, not MAX_NUMNODES, for unknown node (Bjorn Helgaas) - ia64: Remove acpi_get_pxm() usage (Bjorn Helgaas) - ACPI: Fix acpi_get_node() prototype (Bjorn Helgaas) Resource management - i2o: Fix and refactor PCI space allocation (Bjorn Helgaas) - Add resource_contains() (Bjorn Helgaas) - Add %pR support for IORESOURCE_UNSET (Bjorn Helgaas) - Mark resources as IORESOURCE_UNSET if we can't assign them (Bjorn Helgaas) - Don't clear IORESOURCE_UNSET when updating BAR (Bjorn Helgaas) - Check IORESOURCE_UNSET before updating BAR (Bjorn Helgaas) - Don't try to claim IORESOURCE_UNSET resources (Bjorn Helgaas) - Mark 64-bit resource as IORESOURCE_UNSET if we only support 32-bit (Bjorn Helgaas) - Don't enable decoding if BAR hasn't been assigned an address (Bjorn Helgaas) - Add "weak" generic pcibios_enable_device() implementation (Bjorn Helgaas) - alpha, microblaze, sh, sparc, tile: Use default pcibios_enable_device() (Bjorn Helgaas) - s390: Use generic pci_enable_resources() (Bjorn Helgaas) - Don't check resource_size() in pci_bus_alloc_resource() (Bjorn Helgaas) - Set type in __request_region() (Bjorn Helgaas) - Check all IORESOURCE_TYPE_BITS in pci_bus_alloc_from_region() (Bjorn Helgaas) - Change pci_bus_alloc_resource() type_mask to unsigned long (Bjorn Helgaas) - Log IDE resource quirk in dmesg (Bjorn Helgaas) - Revert "[PATCH] Insert GART region into resource map" (Bjorn Helgaas) PCI device hotplug - Make check_link_active() non-static (Rajat Jain) - Use link change notifications for hot-plug and removal (Rajat Jain) - Enable link state change notifications (Rajat Jain) - Don't disable the link permanently during removal (Rajat Jain) - Don't check adapter or latch status while disabling (Rajat Jain) - Disable link notification across slot reset (Rajat Jain) - Ensure very fast hotplug events are also processed (Rajat Jain) - Add hotplug_lock to serialize hotplug events (Rajat Jain) - Remove a non-existent card, regardless of "surprise" capability (Rajat Jain) - Don't turn slot off when hot-added device already exists (Yijing Wang) MSI - Keep pci_enable_msi() documentation (Alexander Gordeev) - ahci: Fix broken single MSI fallback (Alexander Gordeev) - ahci, vfio: Use pci_enable_msi_range() (Alexander Gordeev) - Check kmalloc() return value, fix leak of name (Greg Kroah-Hartman) - Fix leak of msi_attrs (Greg Kroah-Hartman) - Fix pci_msix_vec_count() htmldocs failure (Masanari Iida) Virtualization - Device-specific ACS support (Alex Williamson) Freescale i.MX6 - Wait for retraining (Marek Vasut) Marvell MVEBU - Use Device ID and revision from underlying endpoint (Andrew Lunn) - Fix incorrect size for PCI aperture resources (Jason Gunthorpe) - Call request_resource() on the apertures (Jason Gunthorpe) - Fix potential issue in range parsing (Jean-Jacques Hiblot) Renesas R-Car - Check platform_get_irq() return code (Ben Dooks) - Add error interrupt handling (Ben Dooks) - Fix bridge logic configuration accesses (Ben Dooks) - Register each instance independently (Magnus Damm) - Break out window size handling (Magnus Damm) - Make the Kconfig dependencies more generic (Magnus Damm) Synopsys DesignWare - Fix RC BAR to be single 64-bit non-prefetchable memory (Mohit Kumar) Miscellaneous - Remove unused SR-IOV VF Migration support (Bjorn Helgaas) - Enable INTx if BIOS left them disabled (Bjorn Helgaas) - Fix hex vs decimal typo in cpqhpc_probe() (Dan Carpenter) - Clean up par-arch object file list (Liviu Dudau) - Set IORESOURCE_ROM_SHADOW only for the default VGA device (Sander Eikelenboom) - ACPI, ARM, drm, powerpc, pcmcia, PCI: Use list_for_each_entry() for bus traversal (Yijing Wang) - Fix pci_bus_b() build failure (Paul Gortmaker)" * tag 'pci-v3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (108 commits) Revert "[PATCH] Insert GART region into resource map" PCI: Log IDE resource quirk in dmesg PCI: Change pci_bus_alloc_resource() type_mask to unsigned long PCI: Check all IORESOURCE_TYPE_BITS in pci_bus_alloc_from_region() resources: Set type in __request_region() PCI: Don't check resource_size() in pci_bus_alloc_resource() s390/PCI: Use generic pci_enable_resources() tile PCI RC: Use default pcibios_enable_device() sparc/PCI: Use default pcibios_enable_device() (Leon only) sh/PCI: Use default pcibios_enable_device() microblaze/PCI: Use default pcibios_enable_device() alpha/PCI: Use default pcibios_enable_device() PCI: Add "weak" generic pcibios_enable_device() implementation PCI: Don't enable decoding if BAR hasn't been assigned an address PCI: Enable INTx in pci_reenable_device() only when MSI/MSI-X not enabled PCI: Mark 64-bit resource as IORESOURCE_UNSET if we only support 32-bit PCI: Don't try to claim IORESOURCE_UNSET resources PCI: Check IORESOURCE_UNSET before updating BAR PCI: Don't clear IORESOURCE_UNSET when updating BAR PCI: Mark resources as IORESOURCE_UNSET if we can't assign them ... Conflicts: arch/x86/include/asm/topology.h drivers/ata/ahci.c
2014-03-27vfio: always select ANON_INODESArnd Bergmann1-0/+1
The vfio code cannot be built when CONFIG_ANON_INODES is disabled, so this enforces the symbol to be enabled through Kconfig. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-03-04mm: close PageTail raceDavid Rientjes1-2/+2
Commit bf6bddf1924e ("mm: introduce compaction and migration for ballooned pages") introduces page_count(page) into memory compaction which dereferences page->first_page if PageTail(page). This results in a very rare NULL pointer dereference on the aforementioned page_count(page). Indeed, anything that does compound_head(), including page_count() is susceptible to racing with prep_compound_page() and seeing a NULL or dangling page->first_page pointer. This patch uses Andrea's implementation of compound_trans_head() that deals with such a race and makes it the default compound_head() implementation. This includes a read memory barrier that ensures that if PageTail(head) is true that we return a head page that is neither NULL nor dangling. The patch then adds a store memory barrier to prep_compound_page() to ensure page->first_page is set. This is the safest way to ensure we see the head page that we are expecting, PageTail(page) is already in the unlikely() path and the memory barriers are unfortunately required. Hugetlbfs is the exception, we don't enforce a store memory barrier during init since no race is possible. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Holger Kiehl <Holger.Kiehl@dwd.de> Cc: Christoph Lameter <cl@linux.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-26vfio: Add external user check extension interfaceAlex Williamson1-0/+6
This lets us check extensions, particularly VFIO_DMA_CC_IOMMU using the external user interface, allowing KVM to probe IOMMU coherency. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-02-26vfio/type1: Add extension to test DMA cache coherence of IOMMUAlex Williamson1-0/+21
Now that the type1 IOMMU backend can support IOMMU_CACHE, we need to be able to test whether coherency is currently enforced. Add an extension for this. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-02-26vfio/iommu_type1: Multi-IOMMU domain supportAlex Williamson1-302/+335
We currently have a problem that we cannot support advanced features of an IOMMU domain (ex. IOMMU_CACHE), because we have no guarantee that those features will be supported by all of the hardware units involved with the domain over its lifetime. For instance, the Intel VT-d architecture does not require that all DRHDs support snoop control. If we create a domain based on a device behind a DRHD that does support snoop control and enable SNP support via the IOMMU_CACHE mapping option, we cannot then add a device behind a DRHD which does not support snoop control or we'll get reserved bit faults from the SNP bit in the pagetables. To add to the complexity, we can't know the properties of a domain until a device is attached. We could pass this problem off to userspace and require that a separate vfio container be used, but we don't know how to handle page accounting in that case. How do we know that a page pinned in one container is the same page as a different container and avoid double billing the user for the page. The solution is therefore to support multiple IOMMU domains per container. In the majority of cases, only one domain will be required since hardware is typically consistent within a system. However, this provides us the ability to validate compatibility of domains and support mixed environments where page table flags can be different between domains. To do this, our DMA tracking needs to change. We currently try to coalesce user mappings into as few tracking entries as possible. The problem then becomes that we lose granularity of user mappings. We've never guaranteed that a user is able to unmap at a finer granularity than the original mapping, but we must honor the granularity of the original mapping. This coalescing code is therefore removed, allowing only unmaps covering complete maps. The change in accounting is fairly small here, a typical QEMU VM will start out with roughly a dozen entries, so it's arguable if this coalescing was ever needed. We also move IOMMU domain creation to the point where a group is attached to the container. An interesting side-effect of this is that we now have access to the device at the time of domain creation and can probe the devices within the group to determine the bus_type. This finally makes vfio_iommu_type1 completely device/bus agnostic. In fact, each IOMMU domain can host devices on different buses managed by different physical IOMMUs, and present a single DMA mapping interface to the user. When a new domain is created, mappings are replayed to bring the IOMMU pagetables up to the state of the current container. And of course, DMA mapping and unmapping automatically traverse all of the configured IOMMU domains. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Cc: Varun Sethi <Varun.Sethi@freescale.com>
2014-02-14vfio: Use pci_enable_msi_range() and pci_enable_msix_range()Alexander Gordeev1-4/+8
pci_enable_msix() and pci_enable_msi_block() have been deprecated; use pci_enable_msix_range() and pci_enable_msi_range() instead. [bhelgaas: changelog] Signed-off-by: Alexander Gordeev <agordeev@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Alex Williamson <alex.williamson@redhat.com>
2014-01-27Merge branch 'next' of ↵Linus Torvalds1-14/+14
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc updates from Ben Herrenschmidt: "So here's my next branch for powerpc. A bit late as I was on vacation last week. It's mostly the same stuff that was in next already, I just added two patches today which are the wiring up of lockref for powerpc, which for some reason fell through the cracks last time and is trivial. The highlights are, in addition to a bunch of bug fixes: - Reworked Machine Check handling on kernels running without a hypervisor (or acting as a hypervisor). Provides hooks to handle some errors in real mode such as TLB errors, handle SLB errors, etc... - Support for retrieving memory error information from the service processor on IBM servers running without a hypervisor and routing them to the memory poison infrastructure. - _PAGE_NUMA support on server processors - 32-bit BookE relocatable kernel support - FSL e6500 hardware tablewalk support - A bunch of new/revived board support - FSL e6500 deeper idle states and altivec powerdown support You'll notice a generic mm change here, it has been acked by the relevant authorities and is a pre-req for our _PAGE_NUMA support" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (121 commits) powerpc: Implement arch_spin_is_locked() using arch_spin_value_unlocked() powerpc: Add support for the optimised lockref implementation powerpc/powernv: Call OPAL sync before kexec'ing powerpc/eeh: Escalate error on non-existing PE powerpc/eeh: Handle multiple EEH errors powerpc: Fix transactional FP/VMX/VSX unavailable handlers powerpc: Don't corrupt transactional state when using FP/VMX in kernel powerpc: Reclaim two unused thread_info flag bits powerpc: Fix races with irq_work Move precessing of MCE queued event out from syscall exit path. pseries/cpuidle: Remove redundant call to ppc64_runlatch_off() in cpu idle routines powerpc: Make add_system_ram_resources() __init powerpc: add SATA_MV to ppc64_defconfig powerpc/powernv: Increase candidate fw image size powerpc: Add debug checks to catch invalid cpu-to-node mappings powerpc: Fix the setup of CPU-to-Node mappings during CPU online powerpc/iommu: Don't detach device without IOMMU group powerpc/eeh: Hotplug improvement powerpc/eeh: Call opal_pci_reinit() on powernv for restoring config space powerpc/eeh: Add restore_config operation ...
2014-01-24Merge tag 'vfio-v3.14-rc1' of git://github.com/awilliam/linux-vfioLinus Torvalds3-54/+37
Pull vfio update from Alex Williamson: - convert to misc driver to support module auto loading - remove unnecessary and dangerous use of device_lock * tag 'vfio-v3.14-rc1' of git://github.com/awilliam/linux-vfio: vfio-pci: Don't use device_lock around AER interrupt setup vfio: Convert control interface to misc driver misc: Reserve minor for VFIO
2014-01-15vfio-pci: Use pci "try" reset interfaceAlex Williamson1-20/+9
PCI resets will attempt to take the device_lock for any device to be reset. This is a problem if that lock is already held, for instance in the device remove path. It's not sufficient to simply kill the user process or skip the reset if called after .remove as a race could result in the same deadlock. Instead, we handle all resets as "best effort" using the PCI "try" reset interfaces. This prevents the user from being able to induce a deadlock by triggering a reset. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2014-01-14vfio-pci: Don't use device_lock around AER interrupt setupAlex Williamson2-17/+4
device_lock is much too prone to lockups. For instance if we have a pending .remove then device_lock is already held. If userspace attempts to modify AER signaling after that point, a deadlock occurs. eventfd setup/teardown is already protected in vfio with the igate mutex. AER is not a high performance interrupt, so we can also use the same mutex to protect signaling versus setup races. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-12-30powerpc/iommu: Update constant names to reflect their hardcoded page sizeAlistair Popple1-14/+14
The powerpc iommu uses a hardcoded page size of 4K. This patch changes the name of the IOMMU_PAGE_* macros to reflect the hardcoded values. A future patch will use the existing names to support dynamic page sizes. Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-12-19vfio: Convert control interface to misc driverAlex Williamson1-37/+33
This change allows us to support module auto loading using devname support in userspace tools. With this, /dev/vfio/vfio will always be present and opening it will cause the vfio module to load. This should avoid needing to configure the system to statically load vfio in order to get libvirt to correctly detect support for it. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-12-17PCI: Rename PCI_VC_PORT_REG1/2 to PCI_VC_PORT_CAP1/2Alex Williamson1-6/+6
These are set of two capability registers, it's pretty much given that they're registers, so reflect their purpose in the name. Suggested-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2013-10-11VFIO: vfio_iommu_type1: fix bug caused by break in nested loopAntonios Motakis1-19/+21
In vfio_iommu_type1.c there is a bug in vfio_dma_do_map, when checking that pages are not already mapped. Since the check is being done in a for loop nested within the main loop, breaking out of it does not create the intended behavior. If the underlying IOMMU driver returns a non-NULL value, this will be ignored and mapping the DMA range will be attempted anyway, leading to unpredictable behavior. This interracts badly with the ARM SMMU driver issue fixed in the patch that was submitted with the title: "[PATCH 2/2] ARM: SMMU: return NULL on error in arm_smmu_iova_to_phys" Both fixes are required in order to use the vfio_iommu_type1 driver with an ARM SMMU. This patch refactors the function slightly, in order to also make this kind of bug less likely. Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-09-04vfio-pci: PCI hot reset interfaceAlex Williamson1-1/+285
The current VFIO_DEVICE_RESET interface only maps to PCI use cases where we can isolate the reset to the individual PCI function. This means the device must support FLR (PCIe or AF), PM reset on D3hot->D0 transition, device specific reset, or be a singleton device on a bus for a secondary bus reset. FLR does not have widespread support, PM reset is not very reliable, and bus topology is dictated by the system and device design. We need to provide a means for a user to induce a bus reset in cases where the existing mechanisms are not available or not reliable. This device specific extension to VFIO provides the user with this ability. Two new ioctls are introduced: - VFIO_DEVICE_PCI_GET_HOT_RESET_INFO - VFIO_DEVICE_PCI_HOT_RESET The first provides the user with information about the extent of devices affected by a hot reset. This is essentially a list of devices and the IOMMU groups they belong to. The user may then initiate a hot reset by calling the second ioctl. We must be careful that the user has ownership of all the affected devices found via the first ioctl, so the second ioctl takes a list of file descriptors for the VFIO groups affected by the reset. Each group must have IOMMU protection established for the ioctl to succeed. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-09-04vfio-pci: Test for extended config spaceAlex Williamson1-3/+8
Having PCIe/PCI-X capability isn't enough to assume that there are extended capabilities. Both specs define that the first capability header is all zero if there are no extended capabilities. Testing for this avoids an erroneous message about hiding capability 0x0 at offset 0x100. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-08-28vfio-pci: Use fdget() rather than eventfd_fget()Alex Williamson1-19/+16
eventfd_fget() tests to see whether the file is an eventfd file, which we then immediately pass to eventfd_ctx_fileget(), which again tests whether the file is an eventfd file. Simplify slightly by using fdget() so that we only test that we're looking at an eventfd once. fget() could also be used, but fdget() makes use of fget_light() for another slight optimization. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-08-22vfio: Add O_CLOEXEC flag to vfio device fdAlex Williamson1-1/+1
Add the default O_CLOEXEC flag for device file descriptors. This is generally considered a safer option as it allows the user a race free option to decide whether file descriptors are inherited across exec, with the default avoiding file descriptor leaks. Reported-by: Yann Droneaud <ydroneaud@opteya.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-08-22vfio: use get_unused_fd_flags(0) instead of get_unused_fd()Yann Droneaud1-1/+1
Macro get_unused_fd() is used to allocate a file descriptor with default flags. Those default flags (0) can be "unsafe": O_CLOEXEC must be used by default to not leak file descriptor across exec(). Instead of macro get_unused_fd(), functions anon_inode_getfd() or get_unused_fd_flags() should be used with flags given by userspace. If not possible, flags should be set to O_CLOEXEC to provide userspace with a default safe behavor. In a further patch, get_unused_fd() will be removed so that new code start using anon_inode_getfd() or get_unused_fd_flags() with correct flags. This patch replaces calls to get_unused_fd() with equivalent call to get_unused_fd_flags(0) to preserve current behavor for existing code. The hard coded flag value (0) should be reviewed on a per-subsystem basis, and, if possible, set to O_CLOEXEC. Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Link: http://lkml.kernel.org/r/cover.1376327678.git.ydroneaud@opteya.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-08-05vfio: add external user supportAlexey Kardashevskiy1-0/+62
VFIO is designed to be used via ioctls on file descriptors returned by VFIO. However in some situations support for an external user is required. The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to use the existing VFIO groups for exclusive access in real/virtual mode on a host to avoid passing map/unmap requests to the user space which would made things pretty slow. The protocol includes: 1. do normal VFIO init operation: - opening a new container; - attaching group(s) to it; - setting an IOMMU driver for a container. When IOMMU is set for a container, all groups in it are considered ready to use by an external user. 2. User space passes a group fd to an external user. The external user calls vfio_group_get_external_user() to verify that: - the group is initialized; - IOMMU is set for it. If both checks passed, vfio_group_get_external_user() increments the container user counter to prevent the VFIO group from disposal before KVM exits. 3. The external user calls vfio_external_user_iommu_id() to know an IOMMU ID. PPC64 KVM uses it to link logical bus number (LIOBN) with IOMMU ID. 4. When the external KVM finishes, it calls vfio_group_put_external_user() to release the VFIO group. This call decrements the container user counter. Everything gets released. The "vfio: Limit group opens" patch is also required for the consistency. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-07-24vfio-pci: Avoid deadlock on removeAlex Williamson1-2/+21
If an attempt is made to unbind a device from vfio-pci while that device is in use, the request is blocked until the device becomes unused. Unfortunately, that unbind path still grabs the device_lock, which certain things like __pci_reset_function() also want to take. This means we need to try to acquire the locks ourselves and use the pre-locked version, __pci_reset_function_locked(). Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-07-24vfio: Ignore sprurious notifiesAlex Williamson1-5/+3
Remove debugging WARN_ON if we get a spurious notify for a group that no longer exists. No reports of anyone hitting this, but it would likely be a race and not a bug if they did. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-07-24vfio: Don't overreact to DEL_DEVICEAlex Williamson1-22/+7
BUS_NOTIFY_DEL_DEVICE triggers IOMMU drivers to remove devices from their iommu group, but there's really nothing we can do about it at this point. If the device is in use, then the vfio sub-driver will block the device_del from completing until it's released. If the device is not in use or not owned by a vfio sub-driver, then we really don't care that it's being removed. The current code can be triggered just by unloading an sr-iov driver (ex. igb) while the VFs are attached to vfio-pci because it makes an incorrect assumption about the ordering of driver remove callbacks vs the DEL_DEVICE notification. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-07-10Merge tag 'vfio-v3.11' of git://github.com/awilliam/linux-vfioLinus Torvalds2-225/+415
Pull vfio updates from Alex Williamson: "Largely hugepage support for vfio/type1 iommu and surrounding cleanups and fixes" * tag 'vfio-v3.11' of git://github.com/awilliam/linux-vfio: vfio/type1: Fix leak on error path vfio: Limit group opens vfio/type1: Fix missed frees and zero sized removes vfio: fix documentation vfio: Provide module option to disable vfio_iommu_type1 hugepage support vfio: hugepage support for vfio_iommu_type1 vfio: Convert type1 iommu to use rbtree
2013-07-04Merge branch 'next' of ↵Linus Torvalds4-0/+385
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc updates from Ben Herrenschmidt: "This is the powerpc changes for the 3.11 merge window. In addition to the usual bug fixes and small updates, the main highlights are: - Support for transparent huge pages by Aneesh Kumar for 64-bit server processors. This allows the use of 16M pages as transparent huge pages on kernels compiled with a 64K base page size. - Base VFIO support for KVM on power by Alexey Kardashevskiy - Wiring up of our nvram to the pstore infrastructure, including putting compressed oopses in there by Aruna Balakrishnaiah - Move, rework and improve our "EEH" (basically PCI error handling and recovery) infrastructure. It is no longer specific to pseries but is now usable by the new "powernv" platform as well (no hypervisor) by Gavin Shan. - I fixed some bugs in our math-emu instruction decoding and made it usable to emulate some optional FP instructions on processors with hard FP that lack them (such as fsqrt on Freescale embedded processors). - Support for Power8 "Event Based Branch" facility by Michael Ellerman. This facility allows what is basically "userspace interrupts" for performance monitor events. - A bunch of Transactional Memory vs. Signals bug fixes and HW breakpoint/watchpoint fixes by Michael Neuling. And more ... I appologize in advance if I've failed to highlight something that somebody deemed worth it." * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits) pstore: Add hsize argument in write_buf call of pstore_ftrace_call powerpc/fsl: add MPIC timer wakeup support powerpc/mpic: create mpic subsystem object powerpc/mpic: add global timer support powerpc/mpic: add irq_set_wake support powerpc/85xx: enable coreint for all the 64bit boards powerpc/8xx: Erroneous double irq_eoi() on CPM IRQ in MPC8xx powerpc/fsl: Enable CONFIG_E1000E in mpc85xx_smp_defconfig powerpc/mpic: Add get_version API both for internal and external use powerpc: Handle both new style and old style reserve maps powerpc/hw_brk: Fix off by one error when validating DAWR region end powerpc/pseries: Support compression of oops text via pstore powerpc/pseries: Re-organise the oops compression code pstore: Pass header size in the pstore write callback powerpc/powernv: Fix iommu initialization again powerpc/pseries: Inform the hypervisor we are using EBB regs powerpc/perf: Add power8 EBB support powerpc/perf: Core EBB support for 64-bit book3s powerpc/perf: Drop MMCRA from thread_struct powerpc/perf: Don't enable if we have zero events ...
2013-07-01vfio/type1: Fix leak on error pathAlex Williamson1-5/+8
We also don't handle unpinning zero pages as an error on other exits so we can fix that inconsistency by rolling in the next conditional return. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2013-06-29vfio: remap_pfn_range() sets all those flags...Al Viro1-1/+0
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>