Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
"A lot of activities on cgroup side. Heavy restructuring including
locking simplification took place to improve the code base and enable
implementation of the unified hierarchy, which currently exists behind
a __DEVEL__ mount option. The core support is mostly complete but
individual controllers need further work. To explain the design and
rationales of the the unified hierarchy
Documentation/cgroups/unified-hierarchy.txt
is added.
Another notable change is css (cgroup_subsys_state - what each
controller uses to identify and interact with a cgroup) iteration
update. This is part of continuing updates on css object lifetime and
visibility. cgroup started with reference count draining on removal
way back and is now reaching a point where csses behave and are
iterated like normal refcnted objects albeit with some complexities to
allow distinguishing the state where they're being deleted. The css
iteration update isn't taken advantage of yet but is planned to be
used to simplify memcg significantly"
* 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
cgroup: disallow disabled controllers on the default hierarchy
cgroup: don't destroy the default root
cgroup: disallow debug controller on the default hierarchy
cgroup: clean up MAINTAINERS entries
cgroup: implement css_tryget()
device_cgroup: use css_has_online_children() instead of has_children()
cgroup: convert cgroup_has_live_children() into css_has_online_children()
cgroup: use CSS_ONLINE instead of CGRP_DEAD
cgroup: iterate cgroup_subsys_states directly
cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
cgroup: move cgroup->serial_nr into cgroup_subsys_state
cgroup: link all cgroup_subsys_states in their sibling lists
cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
cgroup: remove cgroup->parent
device_cgroup: remove direct access to cgroup->children
memcg: update memcg_has_children() to use css_next_child()
memcg: remove tasks/children test from mem_cgroup_force_empty()
cgroup: remove css_parent()
cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
cgroup: use cgroup->self.refcnt for cgroup refcnting
...
|
|
Now that 3.15 is released, this merges the 'next' branch into 'master',
bringing us to the normal situation where my 'master' branch is the
merge window.
* accumulated work in next: (6809 commits)
ufs: sb mutex merge + mutex_destroy
powerpc: update comments for generic idle conversion
cris: update comments for generic idle conversion
idle: remove cpu_idle() forward declarations
nbd: zero from and len fields in NBD_CMD_DISCONNECT.
mm: convert some level-less printks to pr_*
MAINTAINERS: adi-buildroot-devel is moderated
MAINTAINERS: add linux-api for review of API/ABI changes
mm/kmemleak-test.c: use pr_fmt for logging
fs/dlm/debug_fs.c: replace seq_printf by seq_puts
fs/dlm/lockspace.c: convert simple_str to kstr
fs/dlm/config.c: convert simple_str to kstr
mm: mark remap_file_pages() syscall as deprecated
mm: memcontrol: remove unnecessary memcg argument from soft limit functions
mm: memcontrol: clean up memcg zoneinfo lookup
mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
mm/mempool.c: update the kmemleak stack trace for mempool allocations
lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
mm: introduce kmemleak_update_trace()
mm/kmemleak.c: use %u to print ->checksum
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
Pull core locking updates from Ingo Molnar:
"The main changes in this cycle were:
- reduced/streamlined smp_mb__*() interface that allows more usecases
and makes the existing ones less buggy, especially in rarer
architectures
- add rwsem implementation comments
- bump up lockdep limits"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
rwsem: Add comments to explain the meaning of the rwsem's count field
lockdep: Increase static allocations
arch: Mass conversion of smp_mb__*()
arch,doc: Convert smp_mb__*()
arch,xtensa: Convert smp_mb__*()
arch,x86: Convert smp_mb__*()
arch,tile: Convert smp_mb__*()
arch,sparc: Convert smp_mb__*()
arch,sh: Convert smp_mb__*()
arch,score: Convert smp_mb__*()
arch,s390: Convert smp_mb__*()
arch,powerpc: Convert smp_mb__*()
arch,parisc: Convert smp_mb__*()
arch,openrisc: Convert smp_mb__*()
arch,mn10300: Convert smp_mb__*()
arch,mips: Convert smp_mb__*()
arch,metag: Convert smp_mb__*()
arch,m68k: Convert smp_mb__*()
arch,m32r: Convert smp_mb__*()
arch,ia64: Convert smp_mb__*()
...
|
|
This bug is discovered by an recent F-RTO issue on tcpm list
https://www.ietf.org/mail-archive/web/tcpm/current/msg08794.html
The bug is that currently F-RTO does not use DSACK to undo cwnd in
certain cases: upon receiving an ACK after the RTO retransmission in
F-RTO, and the ACK has DSACK indicating the retransmission is spurious,
the sender only calls tcp_try_undo_loss() if some never retransmisted
data is sacked (FLAG_ORIG_DATA_SACKED).
The correct behavior is to unconditionally call tcp_try_undo_loss so
the DSACK information is used properly to undo the cwnd reduction.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
the value of itag is a random value from stack, and may not be initiated by
fib_validate_source, which called fib_combine_itag if CONFIG_IP_ROUTE_CLASSID
is not set
This will make the cached dst uncertainty
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We need to initialize the fallback device to have a correct mtu
set on this device. Otherwise the mtu is set to null and the device
is unusable.
Fixes: fd58156e456d ("IPIP: Use ip-tunneling code.")
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The connected check fails to check for ip_gre nbma mode tunnels
properly. ip_gre creates temporary tnl_params with daddr specified
to pass-in the actual target on per-packet basis from neighbor
layer. Detect these tunnels by inspecting the actual tunnel
configuration.
Minimal test case:
ip route add 192.168.1.1/32 via 10.0.0.1
ip route add 192.168.1.2/32 via 10.0.0.2
ip tunnel add nbma0 mode gre key 1 tos c0
ip addr add 172.17.0.0/16 dev nbma0
ip link set nbma0 up
ip neigh add 172.17.0.1 lladdr 192.168.1.1 dev nbma0
ip neigh add 172.17.0.2 lladdr 192.168.1.2 dev nbma0
ping 172.17.0.1
ping 172.17.0.2
The second ping should be going to 192.168.1.2 and head 10.0.0.2;
but cached gre tunnel level route is used and it's actually going
to 192.168.1.1 via 10.0.0.1.
The lladdr's need to go to separate dst for the bug to trigger.
Test case uses separate route entries, but this can also happen
when the route entry is same: if there is a nexthop exception or
the GRE tunnel is IPsec'ed in which case the dst points to xfrm
bundle unique to the gre lladdr.
Fixes: 7d442fab0a67 ("ipv4: Cache dst in tunnels")
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Cc: Tom Herbert <therbert@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Conflicts:
net/ipv4/ip_vti.c
Steffen Klassert says:
====================
pull request (net): ipsec 2014-05-15
This pull request has a merge conflict in net/ipv4/ip_vti.c
between commit 8d89dcdf80d8 ("vti: don't allow to add the same
tunnel twice") and commit a32452366b72 ("vti4:Don't count header
length twice"). It can be solved like it is done in linux-next.
1) Fix a ipv6 xfrm output crash when a packet is rerouted
by netfilter to not use IPsec.
2) vti4 counts some header lengths twice leading to an incorrect
device mtu. Fix this by counting these headers only once.
3) We don't catch the case if an unsupported protocol is submitted
to the xfrm protocol handlers, this can lead to NULL pointer
dereferences. Fix this by adding the appropriate checks.
4) vti6 may unregister pernet ops twice on init errors.
Fix this by removing one of the calls to do it only once.
From Mathias Krause.
5) Set the vti tunnel mark before doing a lookup in the error
handlers. Otherwise we don't find the correct xfrm state.
====================
The conflict in ip_vti.c was simple, 'net' had a commit
removing a line from vti_tunnel_init() and this tree
being merged had a commit adding a line to the same
location.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
cftype->trigger() is pointless. It's trivial to ignore the input
buffer from a regular ->write() operation. Convert all ->trigger()
users to ->write() and remove ->trigger().
This patch doesn't introduce any visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
|
|
Convert all cftype->write_string() users to the new cftype->write()
which maps directly to kernfs write operation and has full access to
kernfs and cgroup contexts. The conversions are mostly mechanical.
* @css and @cft are accessed using of_css() and of_cft() accessors
respectively instead of being specified as arguments.
* Should return @nbytes on success instead of 0.
* @buf is not trimmed automatically. Trim if necessary. Note that
blkcg and netprio don't need this as the parsers already handle
whitespaces.
cftype->write_string() has no user left after the conversions and
removed.
While at it, remove unnecessary local variable @p in
cgroup_subtree_control_write() and stale comment about
CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.
This patch doesn't introduce any visible behavior changes.
v2: netprio was missing from conversion. Converted.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Aristeu Rozanski <arozansk@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
|
|
We need to use the mark we get from the tunnels o_key to
lookup the right vti state in the error handlers. This patch
ensures that.
Fixes: df3893c1 ("vti: Update the ipv4 side to use it's own receive hook.")
Fixes: fa9ad96d ("vti6: Update the ipv6 side to use its own receive hook.")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following batch contains netfilter fixes for your net tree, they are:
1) Fix use after free in nfnetlink when sending a batch for some
unsupported subsystem, from Denys Fedoryshchenko.
2) Skip autoload of the nat module if no binding is specified via
ctnetlink, from Florian Westphal.
3) Set local_df after netfilter defragmentation to avoid a bogus ICMP
fragmentation needed in the forwarding path, also from Florian.
4) Fix potential user after free in ip6_route_me_harder() when returning
the error code to the upper layers, from Sergey Popovich.
5) Skip possible bogus ICMP time exceeded emitted from the router (not
valid according to RFC) if conntrack zones are used, from Vasily Averin.
6) Fix fragment handling when nf_defrag_ipv4 is loaded but nf_conntrack
is not present, also from Vasily.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Similarly, when CONFIG_SYSCTL is not set, ping_group_range should still
work, just that no one can change it. Therefore we should move it out of
sysctl_net_ipv4.c. And, it should not share the same seqlock with
ip_local_port_range.
BTW, rename it to ->ping_group_range instead.
Cc: David S. Miller <davem@davemloft.net>
Cc: Francois Romieu <romieu@fr.zoreil.com>
Reported-by: Stefan de Konink <stefan@konink.de>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When CONFIG_SYSCTL is not set, ip_local_port_range should still work,
just that no one can change it. Therefore we should move it out of sysctl_inet.c.
Also, rename it to ->ip_local_ports instead.
Cc: David S. Miller <davem@davemloft.net>
Cc: Francois Romieu <romieu@fr.zoreil.com>
Reported-by: Stefan de Konink <stefan@konink.de>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Increment fib_info_cnt in fib_create_info() right after successfuly
alllocating fib_info structure, overwise fib_metrics allocation failure
leads to fib_info_cnt incorrectly decremented in free_fib_info(), called
on error path from fib_create_info().
Signed-off-by: Sergey Popovich <popovich_sergei@mail.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Doing the segmentation in the forward path has one major drawback:
When using virtio, we may process gso udp packets coming
from host network stack. In that case, netfilter POSTROUTING
will see one packet with udp header followed by multiple ip
fragments.
Delay the segmentation and do it after POSTROUTING invocation
to avoid this.
Fixes: fe6cc55f3a9 ("net: ip, ipv6: handle gso skbs in forwarding path")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
local_df means 'ignore DF bit if set', so if its set we're
allowed to perform ip fragmentation.
This wasn't noticed earlier because the output path also drops such skbs
(and emits needed icmp error) and because netfilter ip defrag did not
set local_df until couple of days ago.
Only difference is that DF-packets-larger-than MTU now discarded
earlier (f.e. we avoid pointless netfilter postrouting trip).
While at it, drop the repeated test ip_exceeds_mtu, checking it once
is enough...
Fixes: fe6cc55f3a9 ("net: ip, ipv6: handle gso skbs in forwarding path")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In ip_tunnel_rcv(), set skb->network_header to inner IP header
before IP_ECN_decapsulate().
Without the fix, IP_ECN_decapsulate() takes outer IP header as
inner IP header, possibly causing error messages or packet drops.
Note that this skb_reset_network_header() call was in this spot when
the original feature for checking consistency of ECN bits through
tunnels was added in eccc1bb8d4b4 ("tunnel: drop packet if ECN present
with not-ECT"). It was only removed from this spot in 3d7b46cd20e3
("ip_tunnel: push generic protocol handling to ip_tunnel module.").
Fixes: 3d7b46cd20e3 ("ip_tunnel: push generic protocol handling to ip_tunnel module.")
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Ying Cai <ycai@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Defrag user check in ip_expire was not updated after adding support for
"conntrack zones".
This bug manifests as a RFC violation, since the router will send
the icmp time exceeeded message when using conntrack zones.
Signed-off-by: Vasily Averin <vvs@openvz.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
else we may fail to forward skb even if original fragments do fit
outgoing link mtu:
1. remote sends 2k packets in two 1000 byte frags, DF set
2. we want to forward but only see '2k > mtu and DF set'
3. we then send icmp error saying that outgoing link is 1500
But original sender never sent a packet that would not fit
the outgoing link.
Setting local_df makes outgoing path test size vs.
IPCB(skb)->frag_max_size, so we will still send the correct
error in case the largest original size did not fit
outgoing link mtu.
Reported-by: Maxime Bizon <mbizon@freebox.fr>
Suggested-by: Maxime Bizon <mbizon@freebox.fr>
Fixes: 5f2d04f1f9 (ipv4: fix path MTU discovery with connection tracking)
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
commit b9f47a3aaeab (tcp_cubic: limit delayed_ack ratio to prevent
divide error) try to prevent divide error, but there is still a little
chance that delayed_ack can reach zero. In case the param cnt get
negative value, then ratio+cnt would overflow and may happen to be zero.
As a result, min(ratio, ACK_RATIO_LIMIT) will calculate to be zero.
In some old kernels, such as 2.6.32, there is a bug that would
pass negative param, which then ultimately leads to this divide error.
commit 5b35e1e6e9c (tcp: fix tcp_trim_head() to adjust segment count
with skb MSS) fixed the negative param issue. However,
it's safe that we fix the range of delayed_ack as well,
to make sure we do not hit a divide by zero.
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Liu Yu <allanyuliu@tencent.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Both TLP and Fast Open call __tcp_retransmit_skb() instead of
tcp_retransmit_skb() to avoid changing tp->retrans_out.
This has the side effect of missing SNMP counters increments as well
as tcp_info tcpi_total_retrans updates.
Fix this by moving the stats increments of into __tcp_retransmit_skb()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We don't catch the case if an unsupported protocol is submitted
to the xfrm4 protocol handlers, this can lead to NULL pointer
dereferences. Fix this by adding the appropriate checks.
Fixes: 3328715e ("xfrm4: Add IPsec protocol multiplexer")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Mostly scripted conversion of the smp_mb__* barriers.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Because the netdevice may be in another netns than the i/o netns, we should
use the i/o netns instead of dev_net(dev).
The variable 'tunnel' was used only to get 'itn', hence to simplify code I
remove it and use 't' instead.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In my special case, when a packet is redirected from veth0 to lo,
its skb->dev->ifindex would be LOOPBACK_IFINDEX. Meanwhile we
pass the hard-coded LOOPBACK_IFINDEX to fib_validate_source()
in ip_route_input_slow(). This would cause the following check
in fib_validate_source() fail:
(dev->ifindex != oif || !IN_DEV_TX_REDIRECTS(idev))
when rp_filter is disabeld on loopback. As suggested by Julian,
the caller should pass 0 here so that we will not end up by
calling __fib_validate_source().
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As suggested by Julian:
Simply, flowi4_iif must not contain 0, it does not
look logical to ignore all ip rules with specified iif.
because in fib_rule_match() we do:
if (rule->iifindex && (rule->iifindex != fl->flowi_iif))
goto out;
flowi4_iif should be LOOPBACK_IFINDEX by default.
We need to move LOOPBACK_IFINDEX to include/net/flow.h:
1) It is mostly used by flowi_iif
2) Fix the following compile error if we use it in flow.h
by the patches latter:
In file included from include/linux/netfilter.h:277:0,
from include/net/netns/netfilter.h:5,
from include/net/net_namespace.h:21,
from include/linux/netdevice.h:43,
from include/linux/icmpv6.h:12,
from include/linux/ipv6.h:61,
from include/net/ipv6.h:16,
from include/linux/sunrpc/clnt.h:27,
from include/linux/nfs_fs.h:30,
from init/do_mounts.c:32:
include/net/flow.h: In function ‘flowi4_init_output’:
include/net/flow.h:84:32: error: ‘LOOPBACK_IFINDEX’ undeclared (first use in this function)
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We currently count the size of LL_MAX_HEADER and struct iphdr
twice for vti4 devices, this leads to a wrong device mtu.
The size of LL_MAX_HEADER and struct iphdr is already counted in
ip_tunnel_bind_dev(), so don't do it again in vti_tunnel_init().
Fixes: b9959fd3 ("vti: switch to new ip tunnel code")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
In the dst->output() path for ipv4, the code assumes the skb it has to
transmit is attached to an inet socket, specifically via
ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the
provider of the packet is an AF_PACKET socket.
The dst->output() method gets an additional 'struct sock *sk'
parameter. This needs a cascade of changes so that this parameter can
be propagated from vxlan to final consumer.
Fixes: 8f646c922d55 ("vxlan: keep original skb ownership")
Reported-by: lucien xin <lucien.xin@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
ip_queue_xmit() assumes the skb it has to transmit is attached to an
inet socket. Commit 31c70d5956fc ("l2tp: keep original skb ownership")
changed l2tp to not change skb ownership and thus broke this assumption.
One fix is to add a new 'struct sock *sk' parameter to ip_queue_xmit(),
so that we do not assume skb->sk points to the socket used by l2tp
tunnel.
Fixes: 31c70d5956fc ("l2tp: keep original skb ownership")
Reported-by: Zhan Jianyu <nasa4836@gmail.com>
Tested-by: Zhan Jianyu <nasa4836@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Extend commit 13378cad02afc2adc6c0e07fca03903c7ada0b37
("ipv4: Change rt->rt_iif encoding.") from 3.6 to return valid
RTA_IIF on 'ip route get ... iif DEVICE' instead of rt_iif 0
which is displayed as 'iif *'.
inet_iif is not appropriate to use because skb_iif is not set.
Use the skb->dev->ifindex instead.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Plug a group_info refcount leak in ping_init.
group_info is only needed during initialization and
the code failed to release the reference on exit.
While here move grabbing the reference to a place
where it is actually needed.
Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
Signed-off-by: Zhang Dongxing <dongxing.zhang@intel.com>
Signed-off-by: xiaoming wang <xiaoming.wang@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Before the patch, it was possible to add two times the same tunnel:
ip l a vti1 type vti remote 10.16.0.121 local 10.16.0.249 key 41
ip l a vti2 type vti remote 10.16.0.121 local 10.16.0.249 key 41
It was possible, because ip_tunnel_newlink() calls ip_tunnel_find() with the
argument dev->type, which was set only later (when calling ndo_init handler
in register_netdevice()). Let's set this type in the setup handler, which is
called before newlink handler.
Introduced by commit b9959fd3b0fa ("vti: switch to new ip tunnel code").
CC: Cong Wang <amwang@redhat.com>
CC: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Before the patch, it was possible to add two times the same tunnel:
ip l a gre1 type gre remote 10.16.0.121 local 10.16.0.249
ip l a gre2 type gre remote 10.16.0.121 local 10.16.0.249
It was possible, because ip_tunnel_newlink() calls ip_tunnel_find() with the
argument dev->type, which was set only later (when calling ndo_init handler
in register_netdevice()). Let's set this type in the setup handler, which is
called before newlink handler.
Introduced by commit c54419321455 ("GRE: Refactor GRE tunneling code.").
CC: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Several spots in the kernel perform a sequence like:
skb_queue_tail(&sk->s_receive_queue, skb);
sk->sk_data_ready(sk, skb->len);
But at the moment we place the SKB onto the socket receive queue it
can be consumed and freed up. So this skb->len access is potentially
to freed up memory.
Furthermore, the skb->len can be modified by the consumer so it is
possible that the value isn't accurate.
And finally, no actual implementation of this callback actually uses
the length argument. And since nobody actually cared about it's
value, lots of call sites pass arbitrary values in such as '0' and
even '1'.
So just remove the length argument from the callback, that way there
is no confusion whatsoever and all of these use-after-free cases get
fixed as a side effect.
Based upon a patch by Eric Dumazet and his suggestion to audit this
issue tree-wide.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pull more networking updates from David Miller:
1) If a VXLAN interface is created with no groups, we can crash on
reception of packets. Fix from Mike Rapoport.
2) Missing includes in CPTS driver, from Alexei Starovoitov.
3) Fix string validations in isdnloop driver, from YOSHIFUJI Hideaki
and Dan Carpenter.
4) Missing irq.h include in bnxw2x, enic, and qlcnic drivers. From
Josh Boyer.
5) AF_PACKET transmit doesn't statistically count TX drops, from Daniel
Borkmann.
6) Byte-Queue-Limit enabled drivers aren't handled properly in
AF_PACKET transmit path, also from Daniel Borkmann.
Same problem exists in pktgen, and Daniel fixed it there too.
7) Fix resource leaks in driver probe error paths of new sxgbe driver,
from Francois Romieu.
8) Truesize of SKBs can gradually get more and more corrupted in NAPI
packet recycling path, fix from Eric Dumazet.
9) Fix uniprocessor netfilter build, from Florian Westphal. In the
longer term we should perhaps try to find a way for ARRAY_SIZE() to
work even with zero sized array elements.
10) Fix crash in netfilter conntrack extensions due to mis-estimation of
required extension space. From Andrey Vagin.
11) Since we commit table rule updates before trying to copy the
counters back to userspace (it's the last action we perform), we
really can't signal the user copy with an error as we are beyond the
point from which we can unwind everything. This causes all kinds of
use after free crashes and other mysterious behavior.
From Thomas Graf.
12) Restore previous behvaior of div/mod by zero in BPF filter
processing. From Daniel Borkmann.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (38 commits)
net: sctp: wake up all assocs if sndbuf policy is per socket
isdnloop: several buffer overflows
netdev: remove potentially harmful checks
pktgen: fix xmit test for BQL enabled devices
net/at91_ether: avoid NULL pointer dereference
tipc: Let tipc_release() return 0
at86rf230: fix MAX_CSMA_RETRIES parameter
mac802154: fix duplicate #include headers
sxgbe: fix duplicate #include headers
net: filter: be more defensive on div/mod by X==0
netfilter: Can't fail and free after table replacement
xen-netback: Trivial format string fix
net: bcmgenet: Remove unnecessary version.h inclusion
net: smc911x: Remove unused local variable
bonding: Inactive slaves should keep inactive flag's value
netfilter: nf_tables: fix wrong format in request_module()
netfilter: nf_tables: set names cannot be larger than 15 bytes
netfilter: nf_conntrack: reserve two bytes for nf_ct_ext->len
netfilter: Add {ipt,ip6t}_osf aliases for xt_osf
netfilter: x_tables: allow to use cgroup match for LOCAL_IN nf hooks
...
|
|
The RT_CACHE_STAT_INC macro triggers the new preemption checks
for __this_cpu ops.
I do not see any other synchronization that would allow the use of a
__this_cpu operation here however in commit dbd2915ce87e ("[IPV4]:
RT_CACHE_STAT_INC() warning fix") Andrew justifies the use of
raw_smp_processor_id() here because "we do not care" about races. In
the past we agreed that the price of disabling interrupts here to get
consistent counters would be too high. These counters may be inaccurate
due to race conditions.
The use of __this_cpu op improves the situation already from what commit
dbd2915ce87e did since the single instruction emitted on x86 does not
allow the race to occur anymore. However, non x86 platforms could still
experience a race here.
Trace:
__this_cpu_add operation in preemptible [00000000] code: avahi-daemon/1193
caller is __this_cpu_preempt_check+0x38/0x60
CPU: 1 PID: 1193 Comm: avahi-daemon Tainted: GF 3.12.0-rc4+ #187
Call Trace:
check_preemption_disabled+0xec/0x110
__this_cpu_preempt_check+0x38/0x60
__ip_route_output_key+0x575/0x8c0
ip_route_output_flow+0x27/0x70
udp_sendmsg+0x825/0xa20
inet_sendmsg+0x85/0xc0
sock_sendmsg+0x9c/0xd0
___sys_sendmsg+0x37c/0x390
__sys_sendmsg+0x49/0x90
SyS_sendmsg+0x12/0x20
tracesys+0xe1/0xe6
Signed-off-by: Christoph Lameter <cl@linux.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The ipv6 xfrm output path is not aware that packets can be
rerouted by NAT to not use IPsec. We crash in this case
because we expect to have a xfrm state at the dst_entry.
This crash happens if the ipv6 layer does IPsec and NAT
or if we have an interfamily IPsec tunnel with ipv4 NAT.
We fix this by checking for a NAT rerouted packet in each
address family and dst_output() to the new destination
in this case.
Reported-by: Martin Pelikan <martin.pelikan@gmail.com>
Tested-by: Martin Pelikan <martin.pelikan@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
All xtables variants suffer from the defect that the copy_to_user()
to copy the counters to user memory may fail after the table has
already been exchanged and thus exposed. Return an error at this
point will result in freeing the already exposed table. Any
subsequent packet processing will result in a kernel panic.
We can't copy the counters before exposing the new tables as we
want provide the counter state after the old table has been
unhooked. Therefore convert this into a silent error.
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
"A lot updates for cgroup:
- The biggest one is cgroup's conversion to kernfs. cgroup took
after the long abandoned vfs-entangled sysfs implementation and
made it even more convoluted over time. cgroup's internal objects
were fused with vfs objects which also brought in vfs locking and
object lifetime rules. Naturally, there are places where vfs rules
don't fit and nasty hacks, such as credential switching or lock
dance interleaving inode mutex and cgroup_mutex with object serial
number comparison thrown in to decide whether the operation is
actually necessary, needed to be employed.
After conversion to kernfs, internal object lifetime and locking
rules are mostly isolated from vfs interactions allowing shedding
of several nasty hacks and overall simplification. This will also
allow implmentation of operations which may affect multiple cgroups
which weren't possible before as it would have required nesting
i_mutexes.
- Various simplifications including dropping of module support,
easier cgroup name/path handling, simplified cgroup file type
handling and task_cg_lists optimization.
- Prepatory changes for the planned unified hierarchy, which is still
a patchset away from being actually operational. The dummy
hierarchy is updated to serve as the default unified hierarchy.
Controllers which aren't claimed by other hierarchies are
associated with it, which BTW was what the dummy hierarchy was for
anyway.
- Various fixes from Li and others. This pull request includes some
patches to add missing slab.h to various subsystems. This was
triggered xattr.h include removal from cgroup.h. cgroup.h
indirectly got included a lot of files which brought in xattr.h
which brought in slab.h.
There are several merge commits - one to pull in kernfs updates
necessary for converting cgroup (already in upstream through
driver-core), others for interfering changes in the fixes branch"
* 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
cgroup: remove useless argument from cgroup_exit()
cgroup: fix spurious lockdep warning in cgroup_exit()
cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
cgroup: break kernfs active_ref protection in cgroup directory operations
cgroup: fix cgroup_taskset walking order
cgroup: implement CFTYPE_ONLY_ON_DFL
cgroup: make cgrp_dfl_root mountable
cgroup: drop const from @buffer of cftype->write_string()
cgroup: rename cgroup_dummy_root and related names
cgroup: move ->subsys_mask from cgroupfs_root to cgroup
cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
cgroup: reorganize cgroup bootstrapping
cgroup: relocate setting of CGRP_DEAD
cpuset: use rcu_read_lock() to protect task_cs()
cgroup_freezer: document freezer_fork() subtleties
cgroup: update cgroup_transfer_tasks() to either succeed or fail
cgroup: drop task_lock() protection around task->cgroups
cgroup: update how a newly forked task gets associated with css_set
...
|
|
Conflicts:
drivers/net/ethernet/marvell/mvneta.c
The mvneta.c conflict is a case of overlapping changes,
a conversion to devm_ioremap_resource() vs. a conversion
to netdev_alloc_pcpu_stats.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It seems I missed one change in get_timewait4_sock() to compute
the remaining time before deletion of IPV4 timewait socket.
This could result in wrong output in /proc/net/tcp for tm->when field.
Fixes: 96f817fedec4 ("tcp: shrink tcp6_timewait_sock by one cache line")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There is no need to allocate 15 bytes in excess for a SYNACK packet,
as it contains no data, only headers.
SYNACK are always generated in softirq context, and contain a single
segment, we can use TCP_INC_STATS_BH()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
After commit d4589926d7a9 (tcp: refine TSO splits), tcp_nagle_check() does
not use parameter mss_now anymore.
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 10ddceb22ba (ip_tunnel:multicast process cause panic due
to skb->_skb_refdst NULL pointer) removed dst-drop call from
ip-tunnel-recv.
Following commit reintroduce dst-drop and fix the original bug by
checking loopback packet before releasing dst.
Original bug: https://bugzilla.kernel.org/show_bug.cgi?id=70681
CC: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Conflicts:
Documentation/devicetree/bindings/net/micrel-ks8851.txt
net/core/netpoll.c
The net/core/netpoll.c conflict is a bug fix in 'net' happening
to code which is completely removed in 'net-next'.
In micrel-ks8851.txt we simply have overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
ip_rt_dump do nothing after IPv4 route caches removal, so we can remove it.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
ipv4_ifdown_dst does nothing after IPv4 route caches removal,
so we can remove it.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 8cd3ac9f9b7b ("ipmr: advertise new mfc entries via rtnl") reuses the
function ipmr_fill_mroute() to notify mfc events.
But this function was used only for dump and thus was always setting the
flag NLM_F_MULTI, which is wrong in case of a single notification.
Libraries like libnl will wait forever for NLMSG_DONE.
CC: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|