summaryrefslogtreecommitdiffstats
path: root/net/core
AgeCommit message (Collapse)AuthorFilesLines
2010-09-27tcp: Fix >4GB writes on 64-bit.David S. Miller1-2/+3
Fixes kernel bugzilla #16603 tcp_sendmsg() truncates iov_len to an 'int' which a 4GB write to write zero bytes, for example. There is also the problem higher up of how verify_iovec() works. It wants to prevent the total length from looking like an error return value. However it does this using 'int', but syscalls return 'long' (and thus signed 64-bit on 64-bit machines). So it could trigger false-positives on 64-bit as written. So fix it to use 'long'. Reported-by: Olaf Bonorden <bono@onlinehome.de> Reported-by: Daniel Büse <dbuese@gmx.de> Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-24net: fix a lockdep splatEric Dumazet1-4/+4
We have for each socket : One spinlock (sk_slock.slock) One rwlock (sk_callback_lock) Possible scenarios are : (A) (this is used in net/sunrpc/xprtsock.c) read_lock(&sk->sk_callback_lock) (without blocking BH) <BH> spin_lock(&sk->sk_slock.slock); ... read_lock(&sk->sk_callback_lock); ... (B) write_lock_bh(&sk->sk_callback_lock) stuff write_unlock_bh(&sk->sk_callback_lock) (C) spin_lock_bh(&sk->sk_slock) ... write_lock_bh(&sk->sk_callback_lock) stuff write_unlock_bh(&sk->sk_callback_lock) spin_unlock_bh(&sk->sk_slock) This (C) case conflicts with (A) : CPU1 [A] CPU2 [C] read_lock(callback_lock) <BH> spin_lock_bh(slock) <wait to spin_lock(slock)> <wait to write_lock_bh(callback_lock)> We have one problematic (C) use case in inet_csk_listen_stop() : local_bh_disable(); bh_lock_sock(child); // spin_lock_bh(&sk->sk_slock) WARN_ON(sock_owned_by_user(child)); ... sock_orphan(child); // write_lock_bh(&sk->sk_callback_lock) lockdep is not happy with this, as reported by Tetsuo Handa It seems only way to deal with this is to use read_lock_bh(callbacklock) everywhere. Thanks to Jarek for pointing a bug in my first attempt and suggesting this solution. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Jarek Poplawski <jarkao2@gmail.com> Tested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-14net: use rcu_barrier() in rollback_registered_manyEric Dumazet1-1/+1
netdev_wait_allrefs() waits that all references to a device vanishes. It currently uses a _very_ pessimistic 250 ms delay between each probe. Some users reported that no more than 4 devices can be dismantled per second, this is a pretty serious problem for some setups. Most of the time, a refcount is about to be released by an RCU callback, that is still in flight because rollback_registered_many() uses a synchronize_rcu() call instead of rcu_barrier(). Problem is visible if number of online cpus is one, because synchronize_rcu() is then a no op. time to remove 50 ipip tunnels on a UP machine : before patch : real 11.910s after patch : real 1.250s Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reported-by: Octavian Purdila <opurdila@ixiacom.com> Reported-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-08gro: Re-fix different skb headroomsJarek Poplawski1-1/+1
The patch: "gro: fix different skb headrooms" in its part: "2) allocate a minimal skb for head of frag_list" is buggy. The copied skb has p->data set at the ip header at the moment, and skb_gro_offset is the length of ip + tcp headers. So, after the change the length of mac header is skipped. Later skb_set_mac_header() sets it into the NET_SKB_PAD area (if it's long enough) and ip header is misaligned at NET_SKB_PAD + NET_IP_ALIGN offset. There is no reason to assume the original skb was wrongly allocated, so let's copy it as it was. bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=16626 fixes commit: 3d3be4333fdf6faa080947b331a6a19bce1a4f57 Reported-by: Plamen Petrov <pvp-lsts@fs.uni-ruse.bg> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> CC: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Tested-by: Plamen Petrov <pvp-lsts@fs.uni-ruse.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-07net: fix tx queue selection for bridged devices implementing select_queueHelmut Schaa1-8/+8
When a net device is implementing the select_queue callback and is part of a bridge, frames coming from the bridge already have a tx queue associated to the socket (introduced in commit a4ee3ce3293dc931fab19beb472a8bde1295aebe, "net: Use sk_tx_queue_mapping for connected sockets"). The call to sk_tx_queue_get will then return the tx queue used by the bridge instead of calling the select_queue callback. In case of mac80211 this broke QoS which is implemented by using the select_queue callback. Furthermore it introduced problems with rt2x00 because frames with the same TID and RA sometimes appeared on different tx queues which the hw cannot handle correctly. Fix this by always calling select_queue first if it is available and only afterwards use the socket tx queue mapping. Signed-off-by: Helmut Schaa <helmut.schaa@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-02pkt_sched: Fix lockdep warning on est_tree_lock in gen_estimatorJarek Poplawski1-6/+6
This patch fixes a lockdep warning: [ 516.287584] ========================================================= [ 516.288386] [ INFO: possible irq lock inversion dependency detected ] [ 516.288386] 2.6.35b #7 [ 516.288386] --------------------------------------------------------- [ 516.288386] swapper/0 just changed the state of lock: [ 516.288386] (&qdisc_tx_lock){+.-...}, at: [<c12eacda>] est_timer+0x62/0x1b4 [ 516.288386] but this lock took another, SOFTIRQ-unsafe lock in the past: [ 516.288386] (est_tree_lock){+.+...} [ 516.288386] [ 516.288386] and interrupts could create inverse lock ordering between them. ... So, est_tree_lock needs BH protection because it's taken by qdisc_tx_lock, which is used both in BH and process contexts. (Full warning with this patch at netdev, 02 Sep 2010.) Fixes commit: ae638c47dc040b8def16d05dc6acdd527628f231 ("pkt_sched: gen_estimator: add a new lock") Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-01gro: fix different skb headroomsEric Dumazet1-2/+6
Packets entering GRO might have different headrooms, even for a given flow (because of implementation details in drivers, like copybreak). We cant force drivers to deliver packets with a fixed headroom. 1) fix skb_segment() skb_segment() makes the false assumption headrooms of fragments are same than the head. When CHECKSUM_PARTIAL is used, this can give csum_start errors, and crash later in skb_copy_and_csum_dev() 2) allocate a minimal skb for head of frag_list skb_gro_receive() uses netdev_alloc_skb(headroom + skb_gro_offset(p)) to allocate a fresh skb. This adds NET_SKB_PAD to a padding already provided by netdevice, depending on various things, like copybreak. Use alloc_skb() to allocate an exact padding, to reduce cache line needs: NET_SKB_PAD + NET_IP_ALIGN bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=16626 Many thanks to Plamen Petrov, testing many debugging patches ! With help of Jarek Poplawski. Reported-by: Plamen Petrov <pvp-lsts@fs.uni-ruse.bg> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-17net: Fix a memmove bug in dev_gro_receive()Jarek Poplawski1-1/+1
>Xin Xiaohui wrote: > I looked into the code dev_gro_receive(), found the code here: > if the frags[0] is pulled to 0, then the page will be released, > and memmove() frags left. > Is that right? I'm not sure if memmove do right or not, but > frags[0].size is never set after memove at least. what I think > a simple way is not to do anything if we found frags[0].size == 0. > The patch is as followed. ... This version of the patch fixes the bug directly in memmove. Reported-by: "Xin, Xiaohui" <xiaohui.xin@intel.com> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-07net: disable preemption before call smp_processor_id()Changli Gao1-0/+2
Although netif_rx() isn't expected to be called in process context with preemption enabled, it'd better handle this case. And this is why get_cpu() is used in the non-RPS #ifdef branch. If tree RCU is selected, rcu_read_lock() won't disable preemption, so preempt_disable() should be called explictly. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-05net: Fix napi_gro_frags vs netpoll pathJarek Poplawski1-4/+1
The netpoll_rx_on() check in __napi_gro_receive() skips part of the "common" GRO_NORMAL path, especially "pull:" in dev_gro_receive(), where at least eth header should be copied for entirely paged skbs. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-03Revert "net: remove zap_completion_queue"David S. Miller2-3/+32
This reverts commit 15e83ed78864d0625e87a85f09b297c0919a4797. As explained by Johannes Berg, the optimization made here is invalid. Or, at best, incomplete. Not only destructor invocation, but conntract entry releasing must be executed outside of hw IRQ context. So just checking "skb->destructor" is insufficient. Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-02net: cleanup inclusionChangli Gao1-2/+0
Commit ab95bfe01f9872459c8678572ccadbf646badad0 replaces bridge and macvlan hooks in __netif_receive_skb(), so dev.c doesn't need to include their headers. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-01net: ingress filter message limitStephen Hemminger1-4/+4
If user misconfigures ingress and causes a redirection loop, don't overwhelm the log. This is also a error case so make it unlikely. Found by inspection, luckily not in real system. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-27Merge branch 'master' of ↵David S. Miller2-2/+6
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/bnx2x_main.c Merge bnx2x bug fixes in by hand... :-/ Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-26drop_monitor: use genl_register_family_with_ops()Changli Gao1-16/+7
[ Fix unused local variable build warnings. -DaveM ] Signed-off-by: Changli Gao <xiaosuo@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-25net: dev_forward_skb should call nf_resetBen Greear1-0/+1
With conn-track zones and probably with different network namespaces, the netfilter logic needs to be re-calculated on packet receive. If the netfilter logic is not reset, it will not be recalculated properly. This patch adds the nf_reset logic to dev_forward_skb. Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-24net: pskb_expand_head() optimizationEric Dumazet1-1/+1
Move frags[] at the end of struct skb_shared_info, and make pskb_expand_head() copy only the used part of it instead of whole array. This should avoid kmemcheck warnings and speedup pskb_expand_head() as well, avoiding a lot of cache misses. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-24sysfs: add attribute to indicate hw address assignment typeStefan Assmann1-0/+2
Add addr_assign_type to struct net_device and expose it via sysfs. This new attribute has the purpose of giving user-space the ability to distinguish between different assignment types of MAC addresses. For example user-space can treat NICs with randomly generated MAC addresses differently than NICs that have permanent (locally assigned) MAC addresses. For the former udev could write a persistent net rule by matching the device path instead of the MAC address. There's also the case of devices that 'steal' MAC addresses from slave devices. In which it is also be beneficial for user-space to be aware of the fact. This patch also introduces a helper function to assist adoption of drivers that generate MAC addresses randomly. Signed-off-by: Stefan Assmann <sassmann@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-23net: core: don't use own hex_to_bin() methodAndy Shevchenko1-24/+12
Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-22net: Fix skb_copy_expand() handling of ->csum_startDavid S. Miller1-1/+2
It should only be adjusted if ip_summed == CHECKSUM_PARTIAL. Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-22net: Fix corruption of skb csum field in pskb_expand_head() of net/core/skbuff.cAndrea Shepard1-1/+3
Make pskb_expand_head() check ip_summed to make sure csum_start is really csum_start and not csum before adjusting it. This fixes a bug I encountered using a Sun Quad-Fast Ethernet card and VLANs. On my configuration, the sunhme driver produces skbs with differing amounts of headroom on receive depending on the packet size. See line 2030 of drivers/net/sunhme.c; packets smaller than RX_COPY_THRESHOLD have 52 bytes of headroom but packets larger than that cutoff have only 20 bytes. When these packets reach the VLAN driver, vlan_check_reorder_header() calls skb_cow(), which, if the packet has less than NET_SKB_PAD (== 32) bytes of headroom, uses pskb_expand_head() to make more. Then, pskb_expand_head() needs to adjust a lot of offsets into the skb, including csum_start. Since csum_start is a union with csum, if the packet has a valid csum value this will corrupt it, which was the effect I observed. The sunhme hardware computes receive checksums, so the skbs would be created by the driver with ip_summed == CHECKSUM_COMPLETE and a valid csum field, and then pskb_expand_head() would corrupt the csum field, leading to an "hw csum error" message later on, for example in icmp_rcv() for pings larger than the sunhme RX_COPY_THRESHOLD. On the basis of the comment at the beginning of include/linux/skbuff.h, I believe that the csum_start skb field is only meaningful if ip_csummed is CSUM_PARTIAL, so this patch makes pskb_expand_head() adjust it only in that case to avoid corrupting a valid csum value. Please see my more in-depth disucssion of tracking down this bug for more details if you like: http://puellavulnerata.livejournal.com/112186.html http://puellavulnerata.livejournal.com/112567.html http://puellavulnerata.livejournal.com/112891.html http://puellavulnerata.livejournal.com/113096.html http://puellavulnerata.livejournal.com/113591.html I am not subscribed to this list, so please CC me on replies. Signed-off-by: Andrea Shepard <andrea@persephoneslair.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-20Merge branch 'master' of ↵David S. Miller2-8/+17
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/vhost/net.c net/bridge/br_device.c Fix merge conflict in drivers/vhost/net.c with guidance from Stephen Rothwell. Revert the effects of net-2.6 commit 573201f36fd9c7c6d5218cdcd9948cee700b277d since net-next-2.6 has fixes that make bridge netpoll work properly thus we don't need it disabled. Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-20drop_monitor: Add error code to detect duplicate state changesNeil Horman1-2/+8
Patch to add -EAGAIN error to dropwatch netlink message handling code. -EAGAIN will be returned anytime userspace attempts to transition the state of the drop monitor service to a state that its already in. That allows user space to detect this condition, so it doesn't wait for a success ACK that will never arrive. Tested successfully by me Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-20__dst_free(): put EXPORT_SYMBOLS after the fctNicolas Dichtel1-1/+1
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-19net: this_cpu_xxx conversionsEric Dumazet1-3/+2
Use modern this_cpu_xxx() api, saving few bytes on x86 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-19net: 64bit stats for netdev_queueEric Dumazet1-1/+3
Since struct netdev_queue tx_bytes/tx_packets/tx_dropped are already protected by _xmit_lock, its easy to convert these fields to u64 instead of unsigned long. This completes 64bit stats for devices using them (vlan, macvlan, ...) Strictly, we could avoid the locking in dev_txq_stats_fold() on 64bit arches, but its slow path and we prefer keep it simple. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-18net: support time stamping in phy devices.Richard Cochran3-1/+130
This patch adds a new networking option to allow hardware time stamps from PHY devices. When enabled, likely candidates among incoming and outgoing network packets are offered to the PHY driver for possible time stamping. When accepted by the PHY driver, incoming packets are deferred for later delivery by the driver. The patch also adds phylib driver methods for the SIOCSHWTSTAMP ioctl and callbacks for transmit and receive time stamping. Drivers may optionally implement these functions. Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-14net: fix problem in reading sock TX queueTom Herbert1-4/+3
Fix problem in reading the tx_queue recorded in a socket. In dev_pick_tx, the TX queue is read by doing a check with sk_tx_queue_recorded on the socket, followed by a sk_tx_queue_get. The problem is that there is not mutual exclusion across these calls in the socket so it it is possible that the queue in the sock can be invalidated after sk_tx_queue_recorded is called so that sk_tx_queue get returns -1, which sets 65535 in queue_index and thus dev_pick_tx returns 65536 which is a bogus queue and can cause crash in dev_queue_xmit. We fix this by only calling sk_tx_queue_get which does the proper checks. The interface is that sk_tx_queue_get returns the TX queue if the sock argument is non-NULL and TX queue is recorded, else it returns -1. sk_tx_queue_recorded is no longer used so it can be completely removed. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-14net/core: neighbour update OopsDoug Kehn1-1/+4
When configuring DMVPN (GRE + openNHRP) and a GRE remote address is configured a kernel Oops is observed. The obserseved Oops is caused by a NULL header_ops pointer (neigh->dev->header_ops) in neigh_update_hhs() when void (*update)(struct hh_cache*, const struct net_device*, const unsigned char *) = neigh->dev->header_ops->cache_update; is executed. The dev associated with the NULL header_ops is the GRE interface. This patch guards against the possibility that header_ops is NULL. This Oops was first observed in kernel version 2.6.26.8. Signed-off-by: Doug Kehn <rdkehn@yahoo.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-14net: skb_tx_hash() fix relative to skb_orphan_try()Eric Dumazet1-3/+10
commit fc6055a5ba31e2 (net: Introduce skb_orphan_try()) added early orphaning of skbs. This unfortunately added a performance regression in skb_tx_hash() in case of stacked devices (bonding, vlans, ...) Since skb->sk is now NULL, we cannot access sk->sk_hash anymore to spread tx packets to multiple NIC queues on multiqueue devices. skb_tx_hash() in this case only uses skb->protocol, same value for all flows. skb_orphan_try() can copy sk->sk_hash into skb->rxhash and skb_tx_hash() can use this saved sk_hash value to compute its internal hash value. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-12net/core: Remove unnecessary casts of private_dataJoe Perches1-2/+2
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-12net: sock_free() optimizationsEric Dumazet1-2/+3
Avoid two extra instructions in sock_free(), to reload skb->truesize and skb->sk Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-12net/core: EXPORT_SYMBOL cleanupsEric Dumazet13-53/+32
CodingStyle cleanups EXPORT_SYMBOL should immediately follow the symbol declaration. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-09net: Document that dev_get_stats() returns the given pointerBen Hutchings1-6/+6
Document that dev_get_stats() returns the same stats pointer it was given. Remove const qualification from the returned pointer since the caller may do what it likes with that structure. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-09net: Get rid of rtnl_link_stats64 / net_device_stats unionBen Hutchings1-5/+26
In commit be1f3c2c027cc5ad735df6a45a542ed1db7ec48b "net: Enable 64-bit net device statistics on 32-bit architectures" I redefined struct net_device_stats so that it could be used in a union with struct rtnl_link_stats64, avoiding the need for explicit copying or conversion between the two. However, this is unsafe because there is no locking required and no lock consistently held around calls to dev_get_stats() and use of the statistics structure it returns. In commit 28172739f0a276eb8d6ca917b3974c2edb036da3 "net: fix 64 bit counters on 32 bit arches" Eric Dumazet dealt with that problem by requiring callers of dev_get_stats() to provide storage for the result. This means that the net_device::stats64 field and the padding in struct net_device_stats are now redundant, so remove them. Update the comment on net_device_ops::ndo_get_stats64 to reflect its new usage. Change dev_txq_stats_fold() to use struct rtnl_link_stats64, since that is what all its callers are really using and it is no longer going to be compatible with struct net_device_stats. Eric Dumazet suggested the separate function for the structure conversion. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-07Merge branch 'master' of ↵David S. Miller2-11/+48
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
2010-07-07net: fix 64 bit counters on 32 bit archesEric Dumazet3-11/+21
There is a small possibility that a reader gets incorrect values on 32 bit arches. SNMP applications could catch incorrect counters when a 32bit high part is changed by another stats consumer/provider. One way to solve this is to add a rtnl_link_stats64 param to all ndo_get_stats64() methods, and also add such a parameter to dev_get_stats(). Rule is that we are not allowed to use dev->stats64 as a temporary storage for 64bit stats, but a caller provided area (usually on stack) Old drivers (only providing get_stats() method) need no changes. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-04netdevice.h net/core/dev.c: Convert netdev_<level> logging macros to functionsJoe Perches1-0/+62
Reduces an x86 defconfig text and data ~2k. text is smaller, data is larger. $ size vmlinux* text data bss dec hex filename 7198862 720112 1366288 9285262 8dae8e vmlinux 7205273 716016 1366288 9287577 8db799 vmlinux.device_h Uses %pV and struct va_format Format arguments are verified before printk Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-07-02net: decreasing real_num_tx_queues needs to flush qdiscJohn Fastabend1-0/+18
Reducing real_num_queues needs to flush the qdisc otherwise skbs with queue_mappings greater then real_num_tx_queues can be sent to the underlying driver. The flow for this is, dev_queue_xmit() dev_pick_tx() skb_tx_hash() => hash using real_num_tx_queues skb_set_queue_mapping() ... qdisc_enqueue_root() => enqueue skb on txq from hash ... dev->real_num_tx_queues -= n ... sch_direct_xmit() dev_hard_start_xmit() ndo_start_xmit(skb,dev) => skb queue set with old hash skbs are enqueued on the qdisc with skb->queue_mapping set 0 < queue_mappings < real_num_tx_queues. When the driver decreases real_num_tx_queues skb's may be dequeued from the qdisc with a queue_mapping greater then real_num_tx_queues. This fixes a case in ixgbe where this was occurring with DCB and FCoE. Because the driver is using queue_mapping to map skbs to tx descriptor rings we can potentially map skbs to rings that no longer exist. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Tested-by: Ross Brattain <ross.b.brattain@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-30ethtool: Add support for control of RX flow hash indirectionBen Hutchings1-0/+80
Many NICs use an indirection table to map an RX flow hash value to one of an arbitrary number of queues (not necessarily a power of 2). It can be useful to remove some queues from this indirection table so that they are only used for flows that are specifically filtered there. It may also be useful to weight the mapping to account for user processes with the same CPU-affinity as the RX interrupts. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-30ethtool: Change ethtool_op_set_flags to validate flagsBen Hutchings1-23/+5
ethtool_op_set_flags() does not check for unsupported flags, and has no way of doing so. This means it is not suitable for use as a default implementation of ethtool_ops::set_flags. Add a 'supported' parameter specifying the flags that the driver and hardware support, validate the requested flags against this, and change all current callers to pass this parameter. Change some other trivial implementations of ethtool_ops::set_flags to call ethtool_op_set_flags(). Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Reviewed-by: Stanislaw Gruszka <sgruszka@redhat.com> Acked-by: Jeff Garzik <jgarzik@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-30net/core: use ntohs for skb->protocolSebastian Andrzej Siewior1-1/+2
This is only noticed by people that are not doing everything correct in the first place. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-29ethtool: Fix potential user buffer overflow for ETHTOOL_{G, S}RXFHBen Hutchings1-9/+27
struct ethtool_rxnfc was originally defined in 2.6.27 for the ETHTOOL_{G,S}RXFH command with only the cmd, flow_type and data fields. It was then extended in 2.6.30 to support various additional commands. These commands should have been defined to use a new structure, but it is too late to change that now. Since user-space may still be using the old structure definition for the ETHTOOL_{G,S}RXFH commands, and since they do not need the additional fields, only copy the originally defined fields to and from user-space. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Cc: stable@kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-29ethtool: Fix potential kernel buffer overflow in ETHTOOL_GRXCLSRLALLBen Hutchings1-2/+3
On a 32-bit machine, info.rule_cnt >= 0x40000000 leads to integer overflow and the buffer may be smaller than needed. Since ETHTOOL_GRXCLSRLALL is unprivileged, this can presumably be used for at least denial of service. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Cc: stable@kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-28net: use this_cpu_ptr()Eric Dumazet1-2/+2
use this_cpu_ptr(p) instead of per_cpu_ptr(p, smp_processor_id()) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25net/core/pktgen.c: Use pr_<level>Joe Perches1-78/+64
Add pr_fmt(fmt) KBUILD_MODNAME ": " fmt Remove "pktgen: " from formats Convert printks to pr_<level> Added func_enter() for debugging Moved version to end of string at module_init Coalesced long formats Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-25net: optimize Berkeley Packet Filter (BPF) processingHagen Paul Pfeifer1-51/+161
Gcc is currenlty not in the ability to optimize the switch statement in sk_run_filter() because of dense case labels. This patch replace the OR'd labels with ordered sequenced case labels. The sk_chk_filter() function is modified to patch/replace the original OPCODES in a ordered but equivalent form. gcc is now in the ability to transform the switch statement in sk_run_filter into a jump table of complexity O(1). Until this patch gcc generates a sequence of conditional branches (O(n) of 567 byte .text segment size (arch x86_64): 7ff: 8b 06 mov (%rsi),%eax 801: 66 83 f8 35 cmp $0x35,%ax 805: 0f 84 d0 02 00 00 je adb <sk_run_filter+0x31d> 80b: 0f 87 07 01 00 00 ja 918 <sk_run_filter+0x15a> 811: 66 83 f8 15 cmp $0x15,%ax 815: 0f 84 c5 02 00 00 je ae0 <sk_run_filter+0x322> 81b: 77 73 ja 890 <sk_run_filter+0xd2> 81d: 66 83 f8 04 cmp $0x4,%ax 821: 0f 84 17 02 00 00 je a3e <sk_run_filter+0x280> 827: 77 29 ja 852 <sk_run_filter+0x94> 829: 66 83 f8 01 cmp $0x1,%ax [...] With the modification the compiler translate the switch statement into the following jump table fragment: 7ff: 66 83 3e 2c cmpw $0x2c,(%rsi) 803: 0f 87 1f 02 00 00 ja a28 <sk_run_filter+0x26a> 809: 0f b7 06 movzwl (%rsi),%eax 80c: ff 24 c5 00 00 00 00 jmpq *0x0(,%rax,8) 813: 44 89 e3 mov %r12d,%ebx 816: e9 43 03 00 00 jmpq b5e <sk_run_filter+0x3a0> 81b: 41 89 dc mov %ebx,%r12d 81e: e9 3b 03 00 00 jmpq b5e <sk_run_filter+0x3a0> Furthermore, I reordered the instructions to reduce cache line misses by order the most common instruction to the start. Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-24net: fix "netpoll: Allow netpoll_setup/cleanup recursion"Andrew Morton1-1/+0
Remove rtnl_unlock() which had no corresponding rtnl_lock(). Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-23net: consolidate netif_needs_gso() checksJohn Fastabend1-36/+32
netif_needs_gso() is checked twice in the TX path once, before submitting the skb to the qdisc and once after it is dequeued from the qdisc just before calling ndo_hard_start(). This opens a window for a user to change the gso/tso or tx checksum settings that can cause netif_needs_gso to be true in one check and false in the other. Specifically, changing TX checksum setting may cause the warning in skb_gso_segment() to be triggered if the checksum is calculated earlier. This consolidates the netif_needs_gso() calls so that the stack only checks if gso is needed in dev_hard_start_xmit(). Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-16net: Export cred_to_ucred to modules.David S. Miller1-0/+1
AF_UNIX references this, and can be built as a module, so... Signed-off-by: David S. Miller <davem@davemloft.net>