linux - Linux Kernel (branches are rebased on master from time to time)

Age	Commit message (Collapse)	Author	Files	Lines
2017-11-13	afs: Overhaul permit caching	David Howells	9	-187/+244
	Overhaul permit caching in AFS by making it per-vnode and sharing permit lists where possible. When most of the fileserver operations are called, they return a status structure indicating the (revised) details of the vnode or vnodes involved in the operation. This includes the access mark derived from the ACL (named CallerAccess in the protocol definition file). This is cacheable and if the ACL changes, the server will tell us that it is breaking the callback promise, at which point we can discard the currently cached permits. With this patch, the afs_permits structure has, at the end, an array of { key, CallerAccess } elements, sorted by key pointer. This is then cached in a hash table so that it can be shared between vnodes with the same access permits. Permit lists can only be shared if they contain the exact same set of key->CallerAccess mappings. Note that that table is global rather than being per-net_ns. If the keys in a permit list cross net_ns boundaries, there is no problem sharing the cached permits, since the permits are just integer masks. Since permit lists pin keys, the permit cache also makes it easier for a future patch to find all occurrences of a key and remove them by means of setting the afs_permits::invalidated flag and then clearing the appropriate key pointer. In such an event, memory barriers will need adding. Lastly, the permit caching is skipped if the server has sent either a vnode-specific or an entire-server callback since the start of the operation. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13	afs: Overhaul the callback handling	David Howells	14	-794/+443
	Overhaul the AFS callback handling by the following means: (1) Don't give up callback promises on vnodes that we are no longer using, rather let them just expire on the server or let the server break them. This is actually more efficient for the server as the callback lookup is expensive if there are lots of extant callbacks. (2) Only give up the callback promises we have from a server when the server record is destroyed. Then we can just give up all the callback promises on it in one go. (3) Servers can end up being shared between cells if cells are aliased, so don't add all the vnodes being backed by a particular server into a big FID-indexed tree on that server as there may be duplicates. Instead have each volume instance (~= superblock) register an interest in a server as it starts to make use of it and use this to allow the processor for callbacks from the server to find the superblock and thence the inode corresponding to the FID being broken by means of ilookup_nowait(). (4) Rather than iterating over the entire callback list when a mass-break comes in from the server, maintain a counter of mass-breaks in afs_server (cb_seq) and make afs_validate() check it against the copy in afs_vnode. It would be nice not to have to take a read_lock whilst doing this, but that's tricky without using RCU. (5) Save a ref on the fileserver we're using for a call in the afs_call struct so that we can access its cb_s_break during call decoding. (6) Write-lock around callback and status storage in a vnode and read-lock around getattr so that we don't see the status mid-update. This has the following consequences: (1) Data invalidation isn't seen until someone calls afs_validate() on a vnode. Unfortunately, we need to use a key to query the server, but getting one from a background thread is tricky without caching loads of keys all over the place. (2) Mass invalidation isn't seen until someone calls afs_validate(). (3) Callback breaking is going to hit the inode_hash_lock quite a bit. Could this be replaced with rcu_read_lock() since inodes are destroyed under RCU conditions. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13	afs: Rename struct afs_call server member to cm_server	David Howells	3	-11/+10
	Rename the server member of struct afs_call to cm_server as we're only going to be using it for incoming calls for the Cache Manager service. This makes it easier to differentiate from the pointer to the target server for the client, which will point to a different structure to allow for callback handling. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13	afs: Fix the afs_uuid struct to make the char-sized fields signed	David Howells	1	-3/+3
	In AFS's encoding of a UUID, the eight 'char' fields are all signed, so represent them with __s8 rather than __u8. This makes the compiler sign-extend them correctly when XDR-encoding them. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13	afs: Connect up the CB.ProbeUuid	David Howells	1	-0/+3
	The handler for the CB.ProbeUuid operation in the cache manager is implemented, but isn't listed in the switch-statement of operation selection, so won't be used. Fix this by adding it. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13	afs: Potentially return call->reply[0] from afs_make_call()	David Howells	2	-11/+18
	If call->ret_reply0 is set, return call->reply[0] on success. Change the return type of afs_make_call() to long so that this can be passed back without bit loss and then cast to a pointer if required. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13	afs: Condense afs_call's reply{,2,3,4} into an array	David Howells	5	-77/+74
	Condense struct afs_call's reply anchor members - reply{,2,3,4} - into an array. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13	afs: Consolidate abort_to_error translators	David Howells	6	-77/+36
	The AFS abort code space is shared across all services, so there's no need for separate abort_to_error translators for each service. Consolidate them into a single function and remove the function pointers for them. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13	afs: Allow IPv6 address specification of VL servers	David Howells	5	-26/+36
	Allow VL server specifications to be given IPv6 addresses as well as IPv4 addresses, for example as: echo add foo.org 1111:2222:3333:0:4444:5555:6666:7777 >/proc/fs/afs/cells Note that ':' is the expected separator for separating IPv4 addresses, but if a ',' is detected or no '.' is detected in the string, the delimiter is switched to ','. This also works with DNS AFSDB or SRV record strings fetched by upcall from userspace. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13	afs: Keep and pass sockaddr_rxrpc addresses rather than in_addr	David Howells	10	-130/+85
	Keep and pass sockaddr_rxrpc addresses around rather than keeping and passing in_addr addresses to allow for the use of IPv6 and non-standard port numbers in future. This also allows the port and service_id fields to be removed from the afs_call struct. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13	afs: Update the cache index structure	David Howells	4	-251/+50
	Update the cache index structure in the following ways: (1) Don't use the volume name followed by the volume type as levels in the cache index. Volumes can be renamed. Use the volume ID instead. (2) Don't store the VLDB data for a volume in the tree. If the volume database should be cached locally, then it should be done in a separate tree. (3) Expand the volume ID stored in the cache to 64 bits. (4) Expand the file/vnode ID stored in the cache to 96 bits. (5) Increment the cache structure version number to 1. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13	afs: Add some protocol defs	David Howells	3	-9/+35
	Add some protocol definitions, including max field lengths, flag defs, an XDR-encoded UUID def, more VL operation IDs and more fileserver abort codes. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13	afs: Push the net ns pointer to more places	David Howells	11	-56/+56
	Push the network namespace pointer to more places in AFS, including the afs_server structure (which doesn't hold a ref on the netns). In particular, afs_put_cell() now takes requires a net ns parameter so that it can safely alter the netns after decrementing the cell usage count - the cell will be deallocated by a background thread after being cached for a period, which means that it's not safe to access it after reducing its usage count. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13	afs: Note the cell in the superblock info also	David Howells	2	-24/+48
	Keep a reference to the cell in the superblock info structure in addition to the volume and net pointers. This will make it easier to clean up in a future patch in which afs_put_volume() will need the cell pointer. Whilst we're at it, make the cell and volume getting functions return a pointer to the object got to make the call sites look neater. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13	afs: Fix server reaping	David Howells	3	-10/+57
	Fix server reaping and make sure it's all done before we start trying to purge cells, given that servers currently pin cells. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13	afs: Close the rxrpc socket only after purging the servers	David Howells	1	-1/+1
	Close the rxrpc socket only after we've purged the server records (and also cell and volume records which might refer to servers) so that we can give up the callbacks on each server. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13	afs: Lay the groundwork for supporting network namespaces	David Howells	16	-492/+603
	Lay the groundwork for supporting network namespaces (netns) to the AFS filesystem by moving various global features to a network-namespace struct (afs_net) and providing an instance of this as a temporary global variable that everything uses via accessor functions for the moment. The following changes have been made: (1) Store the netns in the superblock info. This will be obtained from the mounter's nsproxy on a manual mount and inherited from the parent superblock on an automount. (2) The cell list is made per-netns. It can be viewed through /proc/net/afs/cells and also be modified by writing commands to that file. (3) The local workstation cell is set per-ns in /proc/net/afs/rootcell. This is unset by default. (4) The 'rootcell' module parameter, which sets a cell and VL server list modifies the init net namespace, thereby allowing an AFS root fs to be theoretically used. (5) The volume location lists and the file lock manager are made per-netns. (6) The AF_RXRPC socket and associated I/O bits are made per-ns. The various workqueues remain global for the moment. Changes still to be made: (1) /proc/fs/afs/ should be moved to /proc/net/afs/ and a symlink emplaced from the old name. (2) A per-netns subsys needs to be registered for AFS into which it can store its per-netns data. (3) Rather than the AF_RXRPC socket being opened on module init, it needs to be opened on the creation of a superblock in that netns. (4) The socket needs to be closed when the last superblock using it is destroyed and all outstanding client calls on it have been completed. This prevents a reference loop on the namespace. (5) It is possible that several namespaces will want to use AFS, in which case each one will need its own UDP port. These can either be set through /proc/net/afs/cm_port or the kernel can pick one at random. The init_ns gets 7001 by default. Other issues that need resolving: (1) The DNS keyring needs net-namespacing. (2) Where do upcalls go (eg. DNS request-key upcall)? (3) Need something like open_socket_in_file_ns() syscall so that AFS command line tools attempting to operate on an AFS file/volume have their RPC calls go to the right place. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13	Pass mode to wait_on_atomic_t() action funcs and provide default actions	David Howells	14	-98/+37
	Make wait_on_atomic_t() pass the TASK_* mode onto its action function as an extra argument and make it 'unsigned int throughout. Also, consolidate a bunch of identical action functions into a default function that can do the appropriate thing for the mode. Also, change the argument name in the bit_wait*() function declarations to reflect the fact that it's the mode and not the bit number. [Peter Z gives this a grudging ACK, but thinks that the whole atomic_t wait should be done differently, though he's not immediately sure as to how] Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> cc: Ingo Molnar <mingo@kernel.org>
2017-11-13	Merge remote-tracking branch 'tip/timers/core' into afs-next	David Howells	251	-1860/+1810
	These AFS patches need the timer_reduce() patch from timers/core. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13	Merge branch 'net-improve-the-process-of-redirect-and-toobig-for-ipv6-tunnels'	David S. Miller	2	-50/+34
	Xin Long says: ==================== net: improve the process of redirect and toobig for ipv6 tunnels Now let's say there are 3 kinds of icmp packets to process for tunnels, toobig(needfrag), redirect, others, their process should be: - toobig(needfrag) update the lower dst's pmtu by route cache, also update sk dst's pmtu if possible, or it will be fine if sk dst pmtu will get updated on tx path. - redirect update the lower dst's gw by route cache and return, no need to send this redirect packet to user sk. - others send the packet to user's sk, or it will also be fine to use err_count to count it and report fail link on tx path. All ipv4 tunnels basically follow this while some of ipv6 tunnels are doing in different ways, like ip6gre and ip6_tunnels update tnl dev's mtu instead of updating lower dst pmtu, no redirect process on their err_handlers, which doesn't make any sense and even causes performance problems. This patchset is to improve the process of redirect and toobig for ip6gre ip4ip6, ip6ip6 tunnels, as in ipv4 tunnels. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	ip6_tunnel: clean up ip4ip6 and ip6ip6's err_handlers	Xin Long	1	-28/+14
	This patch is to remove some useless codes of redirect and fix some indents on ip4ip6 and ip6ip6's err_handlers. Note that redirect icmp packet is already processed in ip6_tnl_err, the old redirect codes in ip4ip6_err actually never worked even before this patch. Besides, there's no need to send redirect to user's sk, it's for lower dst, so just remove it in this patch. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	ip6_tunnel: process toobig in a better way	Xin Long	1	-4/+3
	The same improvement in "ip6_gre: process toobig in a better way" is needed by ip4ip6 and ip6ip6 as well. Note that ip4ip6 and ip6ip6 will also update sk dst pmtu in their err_handlers. Like I said before, gre6 could not do this as it's inner proto is not certain. But for all of them, sk dst pmtu will be updated in tx path if in need. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	ip6_tunnel: add the process for redirect in ip6_tnl_err	Xin Long	1	-5/+10
	The same process for redirect in "ip6_gre: add the process for redirect in ip6gre_err" is needed by ip4ip6 and ip6ip6 as well. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	ip6_gre: process toobig in a better way	Xin Long	1	-13/+2
	Now ip6gre processes toobig icmp packet by setting gre dev's mtu in ip6gre_err, which would cause few things not good: - It couldn't set mtu with dev_set_mtu due to it's not in user context, which causes route cache and idev->cnf.mtu6 not to be updated. - It has to update sk dst pmtu in tx path according to gredev->mtu for ip6gre, while it updates pmtu again according to lower dst pmtu in ip6_tnl_xmit. - To change dev->mtu by toobig icmp packet is not a good idea, it should only work on pmtu. This patch is to process toobig by updating the lower dst's pmtu, as later sk dst pmtu will be updated in ip6_tnl_xmit, the same way as in ip4gre. Note that gre dev's mtu will not be updated any more, it doesn't make any sense to change dev's mtu after receiving a toobig packet. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	ip6_gre: add the process for redirect in ip6gre_err	Xin Long	1	-0/+5
	This patch is to add redirect icmp packet process for ip6gre by calling ip6_redirect() in ip6gre_err(), as in vti6_err. Prior to this patch, there's even no route cache generated after receiving redirect. Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	forcedeth: remove redudant assignments in xmit	Zhu Yanjun	1	-8/+20
	In xmit process, the variables are set many times. In fact, it is enough for these variables to be set once. After a long time test, the throughput performance is better than before. CC: Srinivas Eeda <srinivas.eeda@oracle.com> CC: Joe Jin <joe.jin@oracle.com> CC: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	Merge tag 'nfc-next-4.15-1' of ↵	David S. Miller	17	-42/+64
	git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next Samuel Ortiz says: ==================== NFC 4.15 pull request This is the NFC pull request for 4.15. We have: - A new netlink command for explicitly deactivating NFC targets - i2c constification for all NFC drivers - One NFC device allocation error path fix ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	Merge branch 'Openvswitch-meter-action'	David S. Miller	8	-31/+772
	Andy Zhou says: ==================== Openvswitch meter action This patch series is the first attempt to add openvswitch meter support. We have previously experimented with adding metering support in nftables. However 1) It was not clear how to expose a named nftables object cleanly, and 2) the logic that implements metering is quite small, < 100 lines of code. With those two observations, it seems cleaner to add meter support in the openvswitch module directly. --- v1(RFC)->v2: remove unused code improve locking and other review comments v2 -> v3: rebase v3 -> v4: fix undefined "__udivdi3" references on 32 bit builds. use div_u64() instead. v4 -> v5: rebase ==================== Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	openvswitch: Add meter action support	Andy Zhou	4	-0/+16
	Implements OVS kernel meter action support. Signed-off-by: Andy Zhou <azhou@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	openvswitch: Add meter infrastructure	Andy Zhou	5	-2/+674
	OVS kernel datapath so far does not support Openflow meter action. This is the first stab at adding kernel datapath meter support. This implementation supports only drop band type. Signed-off-by: Andy Zhou <azhou@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	openvswitch: export get_dp() API.	Andy Zhou	2	-29/+31
	Later patches will invoke get_dp() outside of datapath.c. Export it. Signed-off-by: Andy Zhou <azhou@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	openvswitch: Add meter netlink definitions	Andy Zhou	1	-0/+51
	Meter has its own netlink family. Define netlink messages and attributes for communicating with the user space programs. Signed-off-by: Andy Zhou <azhou@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	Merge branch 'dsa-b53-Support-prepended-Broadcom-tags'	David S. Miller	18	-44/+109
	Florian Fainelli says: ==================== net: dsa: b53: Support prepended Broadcom tags This patch series adds support for prepended 4-bytes Broadcom tags that we already support. This type of tag will typically be used when interfaced to a SoC like BCM58xx (NorthStar Plus) which supports a Flow Accelerator (WIP). In that case, we need to support a slightly different tagging format. The first patch does a bit of re-factoring and passes a port index to the get_tag_protocol() function since at least two different drivers need that type of information (mt7530, b53) to support tagging or not. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	net: dsa: b53: Support prepended Broadcom tags	Florian Fainelli	2	-4/+11
	On BCM58xx devices (Northstar Plus), there is an accelerator attached to port 8 which would only work if we use prepended Broadcom tags. Resolve that difference in our get_tag_protocol() function by setting the appropriate tagging protocol in that case. We need to change b53_brcm_hdr_setup() a little bit now since we can deal with two types of Broadcom tags. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	net: dsa: Support prepended Broadcom tag	Florian Fainelli	6	-7/+41
	Add a new type: DSA_TAG_PROTO_PREPEND which allows us to support for the 4-bytes Broadcom tag that we already support, but in a format where it is pre-pended to the packet instead of located between the MAC SA and the Ethertyper (DSA_TAG_PROTO_BRCM). Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	net: dsa: tag_brcm: Prepare for supporting prepended tag	Florian Fainelli	1	-11/+34
	In preparation for supporting the same Broadcom tag format, but instead of inserted between the MAC SA and EtherType, prepended to the Ethernet frame, restructure the code a little bit to make that possible and take an offset parameter. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	net: dsa: Pass a port to get_tag_protocol()	Florian Fainelli	12	-31/+32
	A number of drivers want to check whether the configured CPU port is a possible configuration for enabling tagging, pass down the CPU port number so they verify that. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	net/sched/sch_red.c: work around gcc-4.4.4 anon union initializer issue	Andrew Morton	1	-5/+9
	gcc-4.4.4 (at lest) has issues with initializers and anonymous unions: net/sched/sch_red.c: In function 'red_dump_offload': net/sched/sch_red.c:282: error: unknown field 'stats' specified in initializer net/sched/sch_red.c:282: warning: initialization makes integer from pointer without a cast net/sched/sch_red.c:283: error: unknown field 'stats' specified in initializer net/sched/sch_red.c:283: warning: initialization makes integer from pointer without a cast net/sched/sch_red.c: In function 'red_dump_stats': net/sched/sch_red.c:352: error: unknown field 'xstats' specified in initializer net/sched/sch_red.c:352: warning: initialization makes integer from pointer without a cast Work around this. Fixes: 602f3baf2218 ("net_sch: red: Add offload ability to RED qdisc") Cc: Nogah Frankel <nogahf@mellanox.com> Cc: Jiri Pirko <jiri@mellanox.com> Cc: Simon Horman <simon.horman@netronome.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	net/mlx4: Use Kconfig flag to remove support of old gen2 Mellanox devices	Slava Shwartsman	2	-0/+10
	Since Mellanox focus is on newer adapters, we would like to have the ability to disable the support for old gen2 adapters. This can be done by turning off the MLX4_CORE_GEN2 Kconfig flag. We keep it turned on by default. Signed-off-by: Slava Shwartsman <slavash@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	af_netlink: ensure that NLMSG_DONE never fails in dumps	Jason A. Donenfeld	2	-6/+12
	The way people generally use netlink_dump is that they fill in the skb as much as possible, breaking when nla_put returns an error. Then, they get called again and start filling out the next skb, and again, and so forth. The mechanism at work here is the ability for the iterative dumping function to detect when the skb is filled up and not fill it past the brim, waiting for a fresh skb for the rest of the data. However, if the attributes are small and nicely packed, it is possible that a dump callback function successfully fills in attributes until the skb is of size 4080 (libmnl's default page-sized receive buffer size). The dump function completes, satisfied, and then, if it happens to be that this is actually the last skb, and no further ones are to be sent, then netlink_dump will add on the NLMSG_DONE part: nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI); It is very important that netlink_dump does this, of course. However, in this example, that call to nlmsg_put_answer will fail, because the previous filling by the dump function did not leave it enough room. And how could it possibly have done so? All of the nla_put variety of functions simply check to see if the skb has enough tailroom, independent of the context it is in. In order to keep the important assumptions of all netlink dump users, it is therefore important to give them an skb that has this end part of the tail already reserved, so that the call to nlmsg_put_answer does not fail. Otherwise, library authors are forced to find some bizarre sized receive buffer that has a large modulo relative to the common sizes of messages received, which is ugly and buggy. This patch thus saves the NLMSG_DONE for an additional message, for the case that things are dangerously close to the brim. This requires keeping track of the errno from ->dump() across calls. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	Merge branch 'netem-add-nsec-scheduling-and-slot-feature'	David S. Miller	2	-29/+121
	Dave Taht says: ==================== netem: add nsec scheduling and slot feature This patch series converts netem away from the old "ticks" interface and userspace API, and adds support for a new "slot" feature intended to emulate bursty macs such as WiFi and LTE better. Changes since v2: Use u64 for packet_len_sched_time() Use simpler max(time_to_send,q->slot.slot_next) Changes since v1: Always pass new nanosecond APIs to userspace ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	netem: support delivering packets in delayed time slots	Dave Taht	2	-3/+79
	Slotting is a crude approximation of the behaviors of shared media such as cable, wifi, and LTE, which gather up a bunch of packets within a varying delay window and deliver them, relative to that, nearly all at once. It works within the existing loss, duplication, jitter and delay parameters of netem. Some amount of inherent latency must be specified, regardless. The new "slot" parameter specifies a minimum and maximum delay between transmission attempts. The "bytes" and "packets" parameters can be used to limit the amount of information transferred per slot. Examples of use: tc qdisc add dev eth0 root netem delay 200us \ slot 800us 10ms bytes 64k packets 42 A more correct example, using stacked netem instances and a packet limit to emulate a tail drop wifi queue with slots and variable packet delivery, with a 200Mbit isochronous underlying rate, and 20ms path delay: tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \ limit 10000 tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \ slot 800us 10ms bytes 64k packets 42 limit 512 Signed-off-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	netem: add uapi to express delay and jitter in nanoseconds	Dave Taht	2	-0/+16
	netem userspace has long relied on a horrible /proc/net/psched hack to translate the current notion of "ticks" to nanoseconds. Expressing latency and jitter instead, in well defined nanoseconds, increases the dynamic range of emulated delays and jitter in netem. It will also ease a transition where reducing a tick to nsec equivalence would constrain the max delay in prior versions of netem to only 4.3 seconds. Signed-off-by: Dave Taht <dave.taht@gmail.com> Suggested-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	netem: convert to qdisc_watchdog_schedule_ns	Dave Taht	1	-28/+28
	Upgrade the internal netem scheduler to use nanoseconds rather than ticks throughout. Convert to and from the std "ticks" userspace api automatically, while allowing for finer grained scheduling to take place. Signed-off-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	ipv6: try not to take rtnl_lock in ip6mr_sk_done	Francesco Ruggeri	1	-0/+4
	Avoid traversing the list of mr6_tables (which requires the rtnl_lock) in ip6mr_sk_done(), when we know in advance that a match will not be found. This can happen when rawv6_close()/ip6mr_sk_done() is invoked on non-mroute6 sockets. This patch helps reduce rtnl_lock contention when destroying a large number of net namespaces, each having a non-mroute6 raw socket. v2: same patch, only fixed subject line and expanded comment. Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	net: realtek: r8169: remove redundant assignment to giga_ctrl	Colin Ian King	1	-2/+0
	The variable giga_ctrl is being assigned to zero however this is never read and hence the assignment is redundant, so remove it. Cleans up clang warning: drivers/net/ethernet/realtek/r8169.c:1978:3: warning: Value stored to 'giga_ctrl' is never read Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	net: dsa: lan9303: Documentation: Add missing word "Mbps"	Egil Hjelmeland	1	-3/+3
	Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-13	net: dsa: lan9303: Fix lan9303_alr_del_port()	Egil Hjelmeland	1	-1/+1
	Fix embarrassing bug in lan9303_alr_del_port(): Instead of zeroing entr->mac_addr, I destroyed the next cache entry. Affected .port_fdb_del and .port_mdb_del. Fixes: 0620427ea0d6 ("net: dsa: lan9303: Add fdb/mdb manipulation") Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-12	timers: Add a function to start/reduce a timer	David Howells	2	-7/+39
	Add a function, similar to mod_timer(), that will start a timer if it isn't running and will modify it if it is running and has an expiry time longer than the new time. If the timer is running with an expiry time that's the same or sooner, no change is made. The function looks like: int timer_reduce(struct timer_list timer, unsigned long expires); This can be used by code such as networking code to make it easier to share a timer for multiple timeouts. For instance, in upcoming AF_RXRPC code, the rxrpc_call struct will maintain a number of timeouts: unsigned long ack_at; unsigned long resend_at; unsigned long ping_at; unsigned long expect_rx_by; unsigned long expect_req_by; unsigned long expect_term_by; each of which is set independently of the others. With timer reduction available, when the code needs to set one of the timeouts, it only needs to look at that timeout and then call timer_reduce() to modify the timer, starting it or bringing it forward if necessary. There is no need to refer to the other timeouts to see which is earliest and no need to take any lock other than, potentially, the timer lock inside timer_reduce(). Note, that this does not protect against concurrent invocations of any of the timer functions. As an example, the expect_rx_by timeout above, which terminates a call if we don't get a packet from the server within a certain time window, would be set something like this: unsigned long now = jiffies; unsigned long expect_rx_by = now + packet_receive_timeout; WRITE_ONCE(call->expect_rx_by, expect_rx_by); timer_reduce(&call->timer, expect_rx_by); The timer service code (which might, say, be in a work function) would then check all the timeouts to see which, if any, had triggered, deal with those: t = READ_ONCE(call->ack_at); if (time_after_eq(now, t)) { cmpxchg(&call->ack_at, t, now + MAX_JIFFY_OFFSET); set_bit(RXRPC_CALL_EV_ACK, &call->events); } and then restart the timer if necessary by finding the soonest timeout that hasn't yet passed and then calling timer_reduce(). The disadvantage of doing things this way rather than comparing the timers each time and calling mod_timer() is that you will* take timer events unless you can finish what you're doing and delete the timer in time. The advantage of doing things this way is that you don't need to use a lock to work out when the next timer should be set, other than the timer's own lock - which you might not have to take. [ tglx: Fixed weird formatting and adopted it to pending changes ] Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: keyrings@vger.kernel.org Cc: linux-afs@lists.infradead.org Link: https://lkml.kernel.org/r/151023090769.23050.1801643667223880753.stgit@warthog.procyon.org.uk
2017-11-12	pstore: Use ktime_get_real_fast_ns() instead of __getnstimeofday()	Arnd Bergmann	2	-4/+2
	__getnstimeofday() is a rather odd interface, with a number of quirks: - The caller may come from NMI context, but the implementation is not NMI safe, one way to get there from NMI is NMI handler: something bad panic() kmsg_dump() pstore_dump() pstore_record_init() __getnstimeofday() - The calling conventions are different from any other timekeeping functions, to deal with returning an error code during suspended timekeeping. Address the above issues by using a completely different method to get the time: ktime_get_real_fast_ns() is NMI safe and has a reasonable behavior when timekeeping is suspended: it returns the time at which it got suspended. As Thomas Gleixner explained, this is safe, as ktime_get_real_fast_ns() does not call into the clocksource driver that might be suspended. The result can easily be transformed into a timespec structure. Since ktime_get_real_fast_ns() was not exported to modules, add the export. The pstore behavior for the suspended case changes slightly, as it now stores the timestamp at which timekeeping was suspended instead of storing a zero timestamp. This change is not addressing y2038-safety, that's subject to a more complex follow up patch. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Kees Cook <keescook@chromium.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Colin Cross <ccross@android.com> Link: https://lkml.kernel.org/r/20171110152530.1926955-1-arnd@arndb.de