summaryrefslogtreecommitdiffstats
path: root/drivers/vhost/net.c
AgeCommit message (Collapse)AuthorFilesLines
2017-09-05vhost_net: correctly check tx avail during rx busy pollingJason Wang1-1/+6
We check tx avail through vhost_enable_notify() in the past which is wrong since it only checks whether or not guest has filled more available buffer since last avail idx synchronization which was just done by vhost_vq_avail_empty() before. What we really want is checking pending buffers in the avail ring. Fix this by calling vhost_vq_avail_empty() instead. This issue could be noticed by doing netperf TCP_RR benchmark as client from guest (but not host). With this fix, TCP_RR from guest to localhost restores from 1375.91 trans per sec to 55235.28 trans per sec on my laptop (Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz). Fixes: 030881372460 ("vhost_net: basic polling support") Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01net: convert (struct ubuf_info)->refcnt to refcount_tEric Dumazet1-1/+1
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. v2: added the change in drivers/vhost/net.c as spotted by Willem. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03sock: enable MSG_ZEROCOPYWillem de Bruijn1-0/+1
Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with skb_zerocopy_clone() wherever needed due to skb split, merge, resize or clone. Split skb_orphan_frags into two variants. The split, merge, .. paths support reference counted zerocopy buffers, so do not do a deep copy. Add skb_orphan_frags_rx for paths that may loop packets to receive sockets. That is not allowed, as it may cause unbounded latency. Deep copy all zerocopy copy buffers, ref-counted or not, in this path. The exact locations to modify were chosen by exhaustively searching through all code that might modify skb_frag references and/or the the SKBTX_DEV_ZEROCOPY tx_flags bit. The changes err on the safe side, in two ways. (1) legacy ubuf_info paths virtio and tap are not modified. They keep a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags still call skb_copy_ubufs and thus copy frags in this case. (2) not all copies deep in the stack are addressed yet. skb_shift, skb_split and skb_try_coalesce can be refined to avoid copying. These are not in the hot path and this patch is hairy enough as is, so that is left for future refinement. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-12mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful ↵Michal Hocko1-1/+1
semantic __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to the page allocator. This has been true but only for allocations requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for smaller sizes. This is a bit unfortunate because there is no way to express the same semantic for those requests and they are considered too important to fail so they might end up looping in the page allocator for ever, similarly to GFP_NOFAIL requests. Now that the whole tree has been cleaned up and accidental or misled usage of __GFP_REPEAT flag has been removed for !costly requests we can give the original flag a better name and more importantly a more useful semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user that the allocator would try really hard but there is no promise of a success. This will work independent of the order and overrides the default allocator behavior. Page allocator users have several levels of guarantee vs. cost options (take GFP_KERNEL as an example) - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_ attempt to free memory at all. The most light weight mode which even doesn't kick the background reclaim. Should be used carefully because it might deplete the memory and the next user might hit the more aggressive reclaim - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic allocation without any attempt to free memory from the current context but can wake kswapd to reclaim memory if the zone is below the low watermark. Can be used from either atomic contexts or when the request is a performance optimization and there is another fallback for a slow path. - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non sleeping allocation with an expensive fallback so it can access some portion of memory reserves. Usually used from interrupt/bh context with an expensive slow path fallback. - GFP_KERNEL - both background and direct reclaim are allowed and the _default_ page allocator behavior is used. That means that !costly allocation requests are basically nofail but there is no guarantee of that behavior so failures have to be checked properly by callers (e.g. OOM killer victim is allowed to fail currently). - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and all allocation requests fail early rather than cause disruptive reclaim (one round of reclaim in this implementation). The OOM killer is not invoked. - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior and all allocation requests try really hard. The request will fail if the reclaim cannot make any progress. The OOM killer won't be triggered. - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior and all allocation requests will loop endlessly until they succeed. This might be really dangerous especially for larger orders. Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because they already had their semantic. No new users are added. __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if there is no progress and we have already passed the OOM point. This means that all the reclaim opportunities have been exhausted except the most disruptive one (the OOM killer) and a user defined fallback behavior is more sensible than keep retrying in the page allocator. [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c] [mhocko@suse.com: semantic fix] Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz [mhocko@kernel.org: address other thing spotted by Vlastimil] Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alex Belits <alex.belits@cavium.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David Daney <david.daney@cavium.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: NeilBrown <neilb@suse.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-18vhost_net: try batch dequing from skb arrayJason Wang1-6/+122
We used to dequeue one skb during recvmsg() from skb_array, this could be inefficient because of the bad cache utilization and spinlock touching for each packet. This patch tries to batch them by calling batch dequeuing helpers explicitly on the exported skb array and pass the skb back through msg_control for underlayer socket to finish the userspace copying. Batch dequeuing is also the requirement for more batching improvement on receive path. Tests were done by pktgen on tap with XDP1 in guest. Host is Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz. rx batch | pps 0 2.25Mpps 1 2.33Mpps (+3.56%) 4 2.33Mpps (+3.56%) 16 2.35Mpps (+4.44%) 64 2.42Mpps (+7.56%) <- Default rx batching 128 2.40Mpps (+6.67%) 256 2.38Mpps (+5.78%) Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-08mm: support __GFP_REPEAT in kvmalloc_node for >32kBMichal Hocko1-6/+3
vhost code uses __GFP_REPEAT when allocating vhost_virtqueue resp. vhost_vsock because it would really like to prefer kmalloc to the vmalloc fallback - see 23cc5a991c7a ("vhost-net: extend device allocation to vmalloc") for more context. Michael Tsirkin has also noted: "__GFP_REPEAT overhead is during allocation time. Using vmalloc means all accesses are slowed down. Allocation is not on data path, accesses are." The similar applies to other vhost_kvzalloc users. Let's teach kvmalloc_node to handle __GFP_REPEAT properly. There are two things to be careful about. First we should prevent from the OOM killer and so have to involve __GFP_NORETRY by default and secondly override __GFP_REPEAT for !costly order requests as the __GFP_REPEAT is ignored for !costly orders. Supporting __GFP_REPEAT like semantic for !costly request is possible it would require changes in the page allocator. This is out of scope of this patch. This patch shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/20170306103032.2540-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-02sched/headers: Prepare to move signal wakeup & sigpending methods from ↵Ingo Molnar1-0/+1
<linux/sched.h> into <linux/sched/signal.h> Fix up affected files that include this signal functionality via sched.h. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar1-0/+1
<linux/sched/clock.h> We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which will have to be picked up from other headers and .c files. Create a trivial placeholder <linux/sched/clock.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-11tap: Renaming tap related APIs, data structures, macrosSainath Grandhi1-1/+2
Renaming tap related APIs, data structures and macros in tap.c from macvtap_.* to tap_.* Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18vhost_net: tx batchingJason Wang1-3/+20
This patch tries to utilize tuntap rx batching by peeking the tx virtqueue during transmission, if there's more available buffers in the virtqueue, set MSG_MORE flag for a hint for backend (e.g tuntap) to batch the packets. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-16locking/core: Remove cpu_relax_lowlatency() usersChristian Borntraeger1-2/+2
With the s390 special case of a yielding cpu_relax() implementation gone, we can now remove all users of cpu_relax_lowlatency() and replace them with cpu_relax(). Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Noam Camus <noamc@ezchip.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: linuxppc-dev@lists.ozlabs.org Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1477386195-32736-5-git-send-email-borntraeger@de.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-06Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds1-10/+57
Pull virtio/vhost updates from Michael Tsirkin: - new vsock device support in host and guest - platform IOMMU support in host and guest, including compatibility quirks for legacy systems. - misc fixes and cleanups. * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: VSOCK: Use kvfree() vhost: split out vringh Kconfig vhost: detect 32 bit integer wrap around vhost: new device IOTLB API vhost: drop vringh dependency vhost: convert pre sorted vhost memory array to interval tree vhost: introduce vhost memory accessors VSOCK: Add Makefile and Kconfig VSOCK: Introduce vhost_vsock.ko VSOCK: Introduce virtio_transport.ko VSOCK: Introduce virtio_vsock_common.ko VSOCK: defer sock removal to transports VSOCK: transport-specific vsock_transport functions vhost: drop vringh dependency vop: pull in vhost Kconfig virtio: new feature to detect IOMMU device quirk balloon: check the number of available pages in leak balloon vhost: lockless enqueuing vhost: simplify work flushing
2016-08-02vhost: new device IOTLB APIJason Wang1-6/+53
This patch tries to implement an device IOTLB for vhost. This could be used with userspace(qemu) implementation of DMA remapping to emulate an IOMMU for the guest. The idea is simple, cache the translation in a software device IOTLB (which is implemented as an interval tree) in vhost and use vhost_net file descriptor for reporting IOTLB miss and IOTLB update/invalidation. When vhost meets an IOTLB miss, the fault address, size and access can be read from the file. After userspace finishes the translation, it writes the translated address to the vhost_net file to update the device IOTLB. When device IOTLB is enabled by setting VIRTIO_F_IOMMU_PLATFORM all vq addresses set by ioctl are treated as iova instead of virtual address and the accessing can only be done through IOTLB instead of direct userspace memory access. Before each round or vq processing, all vq metadata is prefetched in device IOTLB to make sure no translation fault happens during vq processing. In most cases, virtqueues are contiguous even in virtual address space. The IOTLB translation for virtqueue itself may make it a little slower. We might add fast path cache on top of this patch. Signed-off-by: Jason Wang <jasowang@redhat.com> [mst: use virtio feature bit: VHOST_F_DEVICE_IOTLB -> VIRTIO_F_IOMMU_PLATFORM ] [mst: fix build warnings ] Signed-off-by: Michael S. Tsirkin <mst@redhat.com> [ weiyj.lk: missing unlock on error ] Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
2016-08-02vhost: convert pre sorted vhost memory array to interval treeJason Wang1-4/+4
Current pre-sorted memory region array has some limitations for future device IOTLB conversion: 1) need extra work for adding and removing a single region, and it's expected to be slow because of sorting or memory re-allocation. 2) need extra work of removing a large range which may intersect several regions with different size. 3) need trick for a replacement policy like LRU To overcome the above shortcomings, this patch convert it to interval tree which can easily address the above issue with almost no extra work. The patch could be used for: - Extend the current API and only let the userspace to send diffs of memory table. - Simplify Device IOTLB implementation. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2016-07-01tun: switch to use skb array for txJason Wang1-1/+15
We used to queue tx packets in sk_receive_queue, this is less efficient since it requires spinlocks to synchronize between producer and consumer. This patch tries to address this by: - switch from sk_receive_queue to a skb_array, and resize it when tx_queue_len was changed. - introduce a new proto_ops peek_len which was used for peeking the skb length. - implement a tun version of peek_len for vhost_net to use and convert vhost_net to use peek_len if possible. Pktgen test shows about 15.3% improvement on guest receiving pps for small buffers: Before: ~1300000pps After : ~1500000pps Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07vhost_net: stop polling socket during rx processingJason Wang1-31/+33
We don't stop rx polling socket during rx processing, this will lead unnecessary wakeups from under layer net devices (E.g sock_def_readable() form tun). Rx will be slowed down in this way. This patch avoids this by stop polling socket during rx processing. A small drawback is that this introduces some overheads in light load case because of the extra start/stop polling, but single netperf TCP_RR does not notice any change. In a super heavy load case, e.g using pktgen to inject packet to guest, we get about ~8.8% improvement on pps: before: ~1240000 pkt/s after: ~1350000 pkt/s Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-11vhost_net: basic polling supportJason Wang1-5/+73
This patch tries to poll for new added tx buffer or socket receive queue for a while at the end of tx/rx processing. The maximum time spent on polling were specified through a new kind of vring ioctl. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2016-03-02vhost: rename vhost_init_used()Greg Kurz1-1/+1
Looking at how callers use this, maybe we should just rename init_used to vhost_vq_init_access. The _used suffix was a hint that we access the vq used ring. But maybe what callers care about is that it must be called after access_ok. Also, this function manipulates the vq->is_le field which isn't related to the vq used ring. This patch simply renames vhost_init_used() to vhost_vq_init_access() as suggested by Michael. No behaviour change. Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2015-09-16vhost: move features to coreMichael S. Tsirkin1-2/+1
virtio 1 and any layout are core features, move them there. This fixes vhost test. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2015-04-11new helper: msg_data_left()Al Viro1-2/+2
convert open-coded instances Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-11/+14
Conflicts: drivers/net/ethernet/rocker/rocker.c The rocker commit was two overlapping changes, one to rename the ->vport member to ->pport, and another making the bitmask expression use '1ULL' instead of plain '1'. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02net: Remove iocb argument from sendmsg and recvmsgYing Xue1-3/+3
After TIPC doesn't depend on iocb argument in its internal implementations of sendmsg() and recvmsg() hooks defined in proto structure, no any user is using iocb argument in them at all now. Then we can drop the redundant iocb argument completely from kinds of implementations of both sendmsg() and recvmsg() in the entire networking stack. Cc: Christoph Hellwig <hch@lst.de> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-27vhost: drop hard-coded num_buffers sizeMichael S. Tsirkin1-1/+2
The 2 that we use for copy_to_iter comes from sizeof(u16), it used to be that way before the iov iter update. Fix it up, making it obvious the size of stack access is right. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-27vhost: cleanup iterator update logicMichael S. Tsirkin1-10/+12
Recent iterator-related changes in vhost made it harder to follow the logic fixing up the header. In fact, the fixup always happens at the same offset: sizeof(virtio_net_hdr): sometimes the fixup iterator is updated by copy_to_iter, sometimes-by iov_iter_advance. Rearrange code to make this obvious. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-15vhost_net: fix wrong iter offset when setting number of buffersJason Wang1-5/+6
In commit ba7438aed924 ("vhost: don't bother copying iovecs in handle_rx(), kill memcpy_toiovecend()"), we advance iov iter fixup sizeof(struct virtio_net_hdr) bytes and fill the number of buffers after doing the socket recvmsg(). This work well but was broken after commit 6e03f896b52c ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net") which tries to advance sizeof(struct virtio_net_hdr_mrg_rxbuf). It will fill the number of buffers at the wrong place. This patch fixes this. Fixes 6e03f896b52c ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net") Cc: David S. Miller <davem@davemloft.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-7/+7
Conflicts: drivers/net/vxlan.c drivers/vhost/net.c include/linux/if_vlan.h net/core/dev.c The net/core/dev.c conflict was the overlap of one commit marking an existing function static whilst another was adding a new function. In the include/linux/if_vlan.h case, the type used for a local variable was changed in 'net', whereas the function got rewritten to fix a stacked vlan bug in 'net-next'. In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next' overlapped with an endainness fix for VHOST 1.0 in 'net'. In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter in 'net-next' whereas in 'net' there was a bug fix to pass in the correct network namespace pointer in calls to this function. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04vhost/net: fix up num_buffers endian-nessMichael S. Tsirkin1-1/+3
In virtio 1.0 mode, when mergeable buffers are enabled on a big-endian host, num_buffers wasn't byte-swapped correctly, so large incoming packets got corrupted. To fix, fill it in within hdr - this also makes sure it gets the correct type. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04vhost: don't bother copying iovecs in handle_rx(), kill memcpy_toiovecend()Al Viro1-59/+23
Cc: Michael S. Tsirkin <mst@redhat.com> Cc: kvm@vger.kernel.org Cc: virtualization@lists.linux-foundation.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-04vhost: don't bother with copying iovec in handle_tx()Al Viro1-4/+5
just advance the msg.msg_iter and be done with that. Cc: Michael S. Tsirkin <mst@redhat.com> Cc: kvm@vger.kernel.org Cc: virtualization@lists.linux-foundation.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-01-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+1
Conflicts: drivers/net/xen-netfront.c Minor overlapping changes in xen-netfront.c, mostly to do with some buffer management changes alongside the split of stats into TX and RX. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-13net: rename vlan_tx_* helpers since "tx" is misleading thereJiri Pirko1-1/+1
The same macros are used for rx as well. So rename it. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-07vhost/net: length miscalculationMichael S. Tsirkin1-1/+1
commit 8b38694a2dc8b18374310df50174f1e4376d6824 vhost/net: virtio 1.0 byte swap had this chunk: - heads[headcount - 1].len += datalen; + heads[headcount - 1].len = cpu_to_vhost32(vq, len - datalen); This adds datalen with the wrong sign, causing guest panics. Fixes: 8b38694a2dc8b18374310df50174f1e4376d6824 Reported-by: Alex Williamson <alex.williamson@redhat.com> Suggested-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2014-12-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds1-5/+3
Pull networking updates from David Miller: 1) New offloading infrastructure and example 'rocker' driver for offloading of switching and routing to hardware. This work was done by a large group of dedicated individuals, not limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend, Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu 2) Start making the networking operate on IOV iterators instead of modifying iov objects in-situ during transfers. Thanks to Al Viro and Herbert Xu. 3) A set of new netlink interfaces for the TIPC stack, from Richard Alpe. 4) Remove unnecessary looping during ipv6 routing lookups, from Martin KaFai Lau. 5) Add PAUSE frame generation support to gianfar driver, from Matei Pavaluca. 6) Allow for larger reordering levels in TCP, which are easily achievable in the real world right now, from Eric Dumazet. 7) Add a variable of napi_schedule that doesn't need to disable cpu interrupts, from Eric Dumazet. 8) Use a doubly linked list to optimize neigh_parms_release(), from Nicolas Dichtel. 9) Various enhancements to the kernel BPF verifier, and allow eBPF programs to actually be attached to sockets. From Alexei Starovoitov. 10) Support TSO/LSO in sunvnet driver, from David L Stevens. 11) Allow controlling ECN usage via routing metrics, from Florian Westphal. 12) Remote checksum offload, from Tom Herbert. 13) Add split-header receive, BQL, and xmit_more support to amd-xgbe driver, from Thomas Lendacky. 14) Add MPLS support to openvswitch, from Simon Horman. 15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen Klassert. 16) Do gro flushes on a per-device basis using a timer, from Eric Dumazet. This tries to resolve the conflicting goals between the desired handling of bulk vs. RPC-like traffic. 17) Allow userspace to ask for the CPU upon what a packet was received/steered, via SO_INCOMING_CPU. From Eric Dumazet. 18) Limit GSO packets to half the current congestion window, from Eric Dumazet. 19) Add a generic helper so that all drivers set their RSS keys in a consistent way, from Eric Dumazet. 20) Add xmit_more support to enic driver, from Govindarajulu Varadarajan. 21) Add VLAN packet scheduler action, from Jiri Pirko. 22) Support configurable RSS hash functions via ethtool, from Eyal Perry. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits) Fix race condition between vxlan_sock_add and vxlan_sock_release net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header net/mlx4: Add support for A0 steering net/mlx4: Refactor QUERY_PORT net/mlx4_core: Add explicit error message when rule doesn't meet configuration net/mlx4: Add A0 hybrid steering net/mlx4: Add mlx4_bitmap zone allocator net/mlx4: Add a check if there are too many reserved QPs net/mlx4: Change QP allocation scheme net/mlx4_core: Use tasklet for user-space CQ completion events net/mlx4_core: Mask out host side virtualization features for guests net/mlx4_en: Set csum level for encapsulated packets be2net: Export tunnel offloads only when a VxLAN tunnel is created gianfar: Fix dma check map error when DMA_API_DEBUG is enabled cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call net: fec: only enable mdio interrupt before phy device link up net: fec: clear all interrupt events to support i.MX6SX net: fec: reset fep link status in suspend function net: sock: fix access via invalid file descriptor net: introduce helper macro for_each_cmsghdr ...
2014-12-09put iov_iter into msghdrAl Viro1-5/+3
Note that the code _using_ ->msg_iter at that point will be very unhappy with anything other than unshifted iovec-backed iov_iter. We still need to convert users to proper primitives. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-09vhost/net: enable virtio 1.0Michael S. Tsirkin1-1/+2
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2014-12-09vhost/net: larger header for virtio 1.0Michael S. Tsirkin1-1/+2
Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
2014-12-09vhost/net: virtio 1.0 byte swapMichael S. Tsirkin1-5/+10
I had to add an explicit tag to suppress compiler warning: gcc isn't smart enough to notice that len is always initialized since function is called with size > 0. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
2014-12-09vhost/net: force len for TX to host endianMichael S. Tsirkin1-5/+5
vhost/net keeps a copy of the used ring in host memory but (ab)uses the length field for internal house-keeping. This works because the length in the used ring for tx is always 0. In order to suppress sparse warnings, we force native endianness here. Note that these values are never exposed to guests. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Jason Wang <jasowang@redhat.com>
2014-06-23vhost-net: don't open-code kvfreeRomain Francoise1-10/+2
Commit 23cc5a991c ("vhost-net: extend device allocation to vmalloc") added another open-coded version of kvfree (which is available since v3.15-rc5), nuke it. Signed-off-by: Romain Francoise <romain@orebokech.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2014-06-09vhost: move memory pointer to VQsMichael S. Tsirkin1-2/+2
commit 2ae76693b8bcabf370b981cd00c36cd41d33fabc vhost: replace rcu with mutex replaced rcu sync for memory accesses with VQ mutex locl/unlock. This is correct since all accesses are under VQ mutex, but incomplete: we still do useless rcu lock/unlock operations, someone might copy this code into some other context where this won't be right. This use of RCU is also non standard and hard to understand. Let's copy the pointer to each VQ structure, this way the access rules become straight-forward, and there's no need for RCU anymore. Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2014-06-09vhost: move acked_features to VQsMichael S. Tsirkin1-5/+3
Refactor code to make sure features are only accessed under VQ mutex. This makes everything simpler, no need for RCU here anymore. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2014-06-09vhost-net: extend device allocation to vmallocMichael S. Tsirkin1-5/+18
Michael Mueller provided a patch to reduce the size of vhost-net structure as some allocations could fail under memory pressure/fragmentation. We are still left with high order allocations though. This patch is handling the problem at the core level, allowing vhost structures to use vmalloc() if kmalloc() failed. As vmalloc() adds overhead on a critical network path, add __GFP_REPEAT to kzalloc() flags to do this fallback only when really needed. People are still looking at cleaner ways to handle the problem at the API level, probably passing in multiple iovecs. This hack seems consistent with approaches taken since then by drivers/vhost/scsi.c and net/core/dev.c Based on patch by Romain Francoise. Cc: Michael Mueller <mimu@linux.vnet.ibm.com> Signed-off-by: Romain Francoise <romain@orebokech.com> Acked-by: Michael S. Tsirkin <mst@redhat.com>
2014-04-01vhost: don't open-code sockfd_put()Al Viro1-7/+7
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-28vhost: validate vhost_get_vq_desc return valueMichael S. Tsirkin1-1/+5
vhost fails to validate negative error code from vhost_get_vq_desc causing a crash: we are using -EFAULT which is 0xfffffff2 as vector size, which exceeds the allocated size. The code in question was introduced in commit 8dd014adfea6f173c1ef6378f7e5e7924866c923 vhost-net: mergeable buffers support CVE-2014-0055 Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28vhost: fix total length when packets are too shortMichael S. Tsirkin1-0/+14
When mergeable buffers are disabled, and the incoming packet is too large for the rx buffer, get_rx_bufs returns success. This was intentional in order for make recvmsg truncate the packet and then handle_rx would detect err != sock_len and drop it. Unfortunately we pass the original sock_len to recvmsg - which means we use parts of iov not fully validated. Fix this up by detecting this overrun and doing packet drop immediately. CVE-2014-0077 Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13vhost: fix a theoretical race in device cleanupMichael S. Tsirkin1-0/+6
vhost_zerocopy_callback accesses VQ right after it drops a ubuf reference. In theory, this could race with device removal which waits on the ubuf kref, and crash on use after free. Do all accesses within rcu read side critical section, and synchronize on release. Since callbacks are always invoked from bh, synchronize_rcu_bh seems enough and will help release complete a bit faster. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13vhost: fix ref cnt checking deadlockMichael S. Tsirkin1-21/+20
vhost checked the counter within the refcnt before decrementing. It really wanted to know that it is the one that has the last reference, as a way to batch freeing resources a bit more efficiently. Note: we only let refcount go to 0 on device release. This works well but we now access the ref counter twice so there's a race: all users might see a high count and decide to defer freeing resources. In the end no one initiates freeing resources until the last reference is gone (which is on VM shotdown so might happen after a looooong time). Let's do what we probably should have done straight away: switch from kref to plain atomic, documenting the semantics, return the refcount value atomically after decrement, then use that to avoid the deadlock. Reported-by: Qin Chuanyu <qinchuanyu@huawei.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06vhost: remove the dead branchZhi Yong Wu1-7/+2
Since vhost_dev_init() forever return 0, some branches are never run, therefore need to be removed. Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-03vhost_net: correctly limit the max pending buffersJason Wang1-11/+7
As Michael point out, We used to limit the max pending DMAs to get better cache utilization. But it was not done correctly since it was one done when there's no new buffers submitted from guest. Guest can easily exceeds the limitation by keeping sending packets. So this patch moves the check into main loop. Tests shows about 5%-10% improvement on per cpu throughput for guest tx. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-03vhost_net: poll vhost queue after marking DMA is doneJason Wang1-4/+5
We used to poll vhost queue before making DMA is done, this is racy if vhost thread were waked up before marking DMA is done which can result the signal to be missed. Fix this by always polling the vhost thread before DMA is done. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>