From 3d43321b7015387cfebbe26436d0e9d299162ea1 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Thu, 2 Apr 2009 15:49:29 -0700 Subject: modules: sysctl to block module loading Implement a sysctl file that disables module-loading system-wide since there is no longer a viable way to remove CAP_SYS_MODULE after the system bounding capability set was removed in 2.6.25. Value can only be set to "1", and is tested only if standard capability checks allow CAP_SYS_MODULE. Given existing /dev/mem protections, this should allow administrators a one-way method to block module loading after initial boot-time module loading has finished. Signed-off-by: Kees Cook Acked-by: Serge Hallyn Signed-off-by: James Morris --- Documentation/sysctl/kernel.txt | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'Documentation') diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index a4ccdd1981cf..02b134956273 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -30,6 +30,7 @@ show up in /proc/sys/kernel: - kstack_depth_to_print [ X86 only ] - l2cr [ PPC only ] - modprobe ==> Documentation/debugging-modules.txt +- modules_disabled - msgmax - msgmnb - msgmni @@ -179,6 +180,16 @@ kernel stack. ============================================================== +modules_disabled: + +A toggle value indicating if modules are allowed to be loaded +in an otherwise modular kernel. This toggle defaults to off +(0), but can be set true (1). Once true, modules can be +neither loaded nor unloaded, and the toggle cannot be set back +to false. + +============================================================== + osrelease, ostype & version: # cat osrelease -- cgit v1.2.3 From 2fad2d9bb8310889f3261035b594b4e068b6eb8b Mon Sep 17 00:00:00 2001 From: Mark Langsdorf Date: Thu, 9 Apr 2009 15:31:53 +0200 Subject: x86/docs: add description for cache_disable sysfs interface Signed-off-by: Mark Langsdorf Signed-off-by: Andreas Herrmann Cc: Andrew Morton LKML-Reference: <20090409133153.GL31527@alberich.amd.com> Signed-off-by: Ingo Molnar --- Documentation/ABI/testing/sysfs-devices-cache_disable | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-devices-cache_disable (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-devices-cache_disable b/Documentation/ABI/testing/sysfs-devices-cache_disable new file mode 100644 index 000000000000..175bb4f70512 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-devices-cache_disable @@ -0,0 +1,18 @@ +What: /sys/devices/system/cpu/cpu*/cache/index*/cache_disable_X +Date: August 2008 +KernelVersion: 2.6.27 +Contact: mark.langsdorf@amd.com +Description: These files exist in every cpu's cache index directories. + There are currently 2 cache_disable_# files in each + directory. Reading from these files on a supported + processor will return that cache disable index value + for that processor and node. Writing to one of these + files will cause the specificed cache index to be disabled. + + Currently, only AMD Family 10h Processors support cache index + disable, and only for their L3 caches. See the BIOS and + Kernel Developer's Guide at + http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/31116-Public-GH-BKDG_3.20_2-4-09.pdf + for formatting information and other details on the + cache index disable. +Users: joachim.deguara@amd.com -- cgit v1.2.3 From 56c49951747f250d8398582509e02ae5ce1d36d1 Mon Sep 17 00:00:00 2001 From: Theodore Ts'o Date: Sat, 11 Apr 2009 15:51:19 -0400 Subject: tracing: Add documentation for the power tracer Signed-off-by: "Theodore Ts'o" Acked-by: Arjan van de Ven Cc: Frederic Weisbecker Cc: Steven Rostedt LKML-Reference: <1239479479-2603-4-git-send-email-tytso@mit.edu> Signed-off-by: Ingo Molnar --- Documentation/trace/power.txt | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) create mode 100644 Documentation/trace/power.txt (limited to 'Documentation') diff --git a/Documentation/trace/power.txt b/Documentation/trace/power.txt new file mode 100644 index 000000000000..cd805e16dc27 --- /dev/null +++ b/Documentation/trace/power.txt @@ -0,0 +1,17 @@ +The power tracer collects detailed information about C-state and P-state +transitions, instead of just looking at the high-level "average" +information. + +There is a helper script found in scrips/tracing/power.pl in the kernel +sources which can be used to parse this information and create a +Scalable Vector Graphics (SVG) picture from the trace data. + +To use this tracer: + + echo 0 > /sys/kernel/debug/tracing/tracing_enabled + echo power > /sys/kernel/debug/tracing/current_tracer + echo 1 > /sys/kernel/debug/tracing/tracing_enabled + sleep 1 + echo 0 > /sys/kernel/debug/tracing/tracing_enabled + cat /sys/kernel/debug/tracing/trace | \ + perl scripts/tracing/power.pl > out.sv -- cgit v1.2.3 From abd41443ac76d3e9c29a8c1d9e9a3312306cc55e Mon Sep 17 00:00:00 2001 From: Theodore Ts'o Date: Sat, 11 Apr 2009 15:51:18 -0400 Subject: tracing: Document the event tracing system Signed-off-by: "Theodore Ts'o" Cc: Theodore Ts'o Cc: Steven Rostedt LKML-Reference: <1239479479-2603-3-git-send-email-tytso@mit.edu> Signed-off-by: Ingo Molnar --- Documentation/trace/events.txt | 135 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 135 insertions(+) create mode 100644 Documentation/trace/events.txt (limited to 'Documentation') diff --git a/Documentation/trace/events.txt b/Documentation/trace/events.txt new file mode 100644 index 000000000000..abdee664c0f6 --- /dev/null +++ b/Documentation/trace/events.txt @@ -0,0 +1,135 @@ + Event Tracing + + Documentation written by Theodore Ts'o + +Introduction +============ + +Tracepoints (see Documentation/trace/tracepoints.txt) can be used +without creating custom kernel modules to register probe functions +using the event tracing infrastructure. + +Not all tracepoints can be traced using the event tracing system; +the kernel developer must provide code snippets which define how the +tracing information is saved into the tracing buffer, and how the +the tracing information should be printed. + +Using Event Tracing +=================== + +The events which are available for tracing can be found in the file +/sys/kernel/debug/tracing/available_events. + +To enable a particular event, such as 'sched_wakeup', simply echo it +to /sys/debug/tracing/set_event. For example: + + # echo sched_wakeup > /sys/kernel/debug/tracing/set_event + +[ Note: events can also be enabled/disabled via the 'enabled' toggle + found in the /sys/kernel/tracing/events/ hierarchy of directories. ] + +To disable an event, echo the event name to the set_event file prefixed +with an exclamation point: + + # echo '!sched_wakeup' >> /sys/kernel/debug/tracing/set_event + +To disable events, echo an empty line to the set_event file: + + # echo > /sys/kernel/debug/tracing/set_event + +The events are organized into subsystems, such as ext4, irq, sched, +etc., and a full event name looks like this: :. The +subsystem name is optional, but it is displayed in the available_events +file. All of the events in a subsystem can be specified via the syntax +":*"; for example, to enable all irq events, you can use the +command: + + # echo 'irq:*' > /sys/kernel/debug/tracing/set_event + +Defining an event-enabled tracepoint +------------------------------------ + +A kernel developer which wishes to define an event-enabled tracepoint +must declare the tracepoint using TRACE_EVENT instead of DECLARE_TRACE. +This is done via two header files in include/trace. For example, to +event-enable the jbd2 subsystem, we must create two files, +include/trace/jbd2.h and include/trace/jbd2_event_types.h. The +include/trace/jbd2.h file should be included by kernel source files that +will have a tracepoint inserted, and might look like this: + +#ifndef _TRACE_JBD2_H +#define _TRACE_JBD2_H + +#include +#include + +#include + +#endif + +In a file that utilizes a jbd2 tracepoint, this header file would be +included. Note that you still have to use DEFINE_TRACE(). So for +example, if fs/jbd2/commit.c planned to use the jbd2_start_commit +tracepoint, it would have the following near the beginning of the file: + +#include + +DEFINE_TRACE(jbd2_start_commit); + +Then in the function that would call the tracepoint, it would call the +tracepoint function. (For more information, please see the tracepoint +documentation in Documentation/trace/tracepoints.txt): + + trace_jbd2_start_commit(journal, commit_transaction); + +The code snippets which allow jbd2_start_commit to be an event-enabled +tracepoint are placed in the file include/trace/jbd2_event_types.h: + +/* use instead */ +#ifndef TRACE_EVENT +# error Do not include this file directly. +# error Unless you know what you are doing. +#endif + +#undef TRACE_SYSTEM +#define TRACE_SYSTEM jbd2 + +#include + +TRACE_EVENT(jbd2_start_commit, + TP_PROTO(journal_t *journal, transaction_t *commit_transaction), + TP_ARGS(journal, commit_transaction), + TP_STRUCT__entry( + __array( char, devname, BDEVNAME_SIZE+24 ) + __field( int, transaction ) + ), + TP_fast_assign( + memcpy(__entry->devname, journal->j_devname, BDEVNAME_SIZE+24); + __entry->transaction = commit_transaction->t_tid; + ), + TP_printk("dev %s transaction %d", + __entry->devname, __entry->transaction) +); + +The TP_PROTO and TP_ARGS are unchanged from DECLARE_TRACE. The new +arguments to TRACE_EVENT are TP_STRUCT__entry, TP_fast_assign, and +TP_printk. + +TP_STRUCT__entry defines the data structure which will be stored in the +trace buffer. Normally, fields in __entry will be arrays or simple +types. It is possible to place data structures in __entry --- however, +pointers in the data structure can not be trusted, since they will be +accessed sometime later by TP_printk, and if the data structure contains +fields that will not or cannot be used by TP_printk, this will waste +space in the trace buffer. In general, data structures should be +avoided, unless they do only contain non-pointer types and all of the +fields will be used by TP_printk. + +TP_fast_assign defines the code snippet which saves information into the +__entry data structure, using the passed-in arguments defined in +TP_PROTO and TP_ARGS. + +Finally, TP_printk will print the __entry data structure. At the time +when the code snippet defined by TP_printk is executed, it will not have +access to the TP_ARGS arguments; it can only use the information saved +in the __entry data structure. -- cgit v1.2.3 From ecfcc53fef3c357574bb6143dce6631e6d56295c Mon Sep 17 00:00:00 2001 From: Etienne Basset Date: Wed, 8 Apr 2009 20:40:06 +0200 Subject: smack: implement logging V3 the following patch, add logging of Smack security decisions. This is of course very useful to understand what your current smack policy does. As suggested by Casey, it also now forbids labels with ', " or \ It introduces a '/smack/logging' switch : 0: no logging 1: log denied (default) 2: log accepted 3: log denied&accepted Signed-off-by: Etienne Basset Acked-by: Casey Schaufler Acked-by: Eric Paris Signed-off-by: James Morris --- Documentation/Smack.txt | 20 ++- security/Makefile | 3 + security/smack/smack.h | 108 +++++++++++- security/smack/smack_access.c | 143 ++++++++++++++-- security/smack/smack_lsm.c | 390 ++++++++++++++++++++++++++++++++---------- security/smack/smackfs.c | 66 +++++++ 6 files changed, 618 insertions(+), 112 deletions(-) (limited to 'Documentation') diff --git a/Documentation/Smack.txt b/Documentation/Smack.txt index 629c92e99783..34614b4c708e 100644 --- a/Documentation/Smack.txt +++ b/Documentation/Smack.txt @@ -184,8 +184,9 @@ length. Single character labels using special characters, that being anything other than a letter or digit, are reserved for use by the Smack development team. Smack labels are unstructured, case sensitive, and the only operation ever performed on them is comparison for equality. Smack labels cannot -contain unprintable characters or the "/" (slash) character. Smack labels -cannot begin with a '-', which is reserved for special options. +contain unprintable characters, the "/" (slash), the "\" (backslash), the "'" +(quote) and '"' (double-quote) characters. +Smack labels cannot begin with a '-', which is reserved for special options. There are some predefined labels: @@ -523,3 +524,18 @@ Smack supports some mount options: These mount options apply to all file system types. +Smack auditing + +If you want Smack auditing of security events, you need to set CONFIG_AUDIT +in your kernel configuration. +By default, all denied events will be audited. You can change this behavior by +writing a single character to the /smack/logging file : +0 : no logging +1 : log denied (default) +2 : log accepted +3 : log denied & accepted + +Events are logged as 'key=value' pairs, for each event you at least will get +the subjet, the object, the rights requested, the action, the kernel function +that triggered the event, plus other pairs depending on the type of event +audited. diff --git a/security/Makefile b/security/Makefile index fa77021d9778..c67557cdaa85 100644 --- a/security/Makefile +++ b/security/Makefile @@ -16,6 +16,9 @@ obj-$(CONFIG_SECURITYFS) += inode.o # Must precede capability.o in order to stack properly. obj-$(CONFIG_SECURITY_SELINUX) += selinux/built-in.o obj-$(CONFIG_SECURITY_SMACK) += smack/built-in.o +ifeq ($(CONFIG_AUDIT),y) +obj-$(CONFIG_SECURITY_SMACK) += lsm_audit.o +endif obj-$(CONFIG_SECURITY_TOMOYO) += tomoyo/built-in.o obj-$(CONFIG_SECURITY_ROOTPLUG) += root_plug.o obj-$(CONFIG_CGROUP_DEVICE) += device_cgroup.o diff --git a/security/smack/smack.h b/security/smack/smack.h index 42ef313f9856..243bec175be0 100644 --- a/security/smack/smack.h +++ b/security/smack/smack.h @@ -20,6 +20,7 @@ #include #include #include +#include /* * Why 23? CIPSO is constrained to 30, so a 32 byte buffer is @@ -178,6 +179,20 @@ struct smack_known { #define MAY_READWRITE (MAY_READ | MAY_WRITE) #define MAY_NOT 0 +/* + * Number of access types used by Smack (rwxa) + */ +#define SMK_NUM_ACCESS_TYPE 4 + +/* + * Smack audit data; is empty if CONFIG_AUDIT not set + * to save some stack + */ +struct smk_audit_info { +#ifdef CONFIG_AUDIT + struct common_audit_data a; +#endif +}; /* * These functions are in smack_lsm.c */ @@ -186,8 +201,8 @@ struct inode_smack *new_inode_smack(char *); /* * These functions are in smack_access.c */ -int smk_access(char *, char *, int); -int smk_curacc(char *, u32); +int smk_access(char *, char *, int, struct smk_audit_info *); +int smk_curacc(char *, u32, struct smk_audit_info *); int smack_to_cipso(const char *, struct smack_cipso *); void smack_from_cipso(u32, char *, char *); char *smack_from_secid(const u32); @@ -237,4 +252,93 @@ static inline char *smk_of_inode(const struct inode *isp) return sip->smk_inode; } +/* + * logging functions + */ +#define SMACK_AUDIT_DENIED 0x1 +#define SMACK_AUDIT_ACCEPT 0x2 +extern int log_policy; + +void smack_log(char *subject_label, char *object_label, + int request, + int result, struct smk_audit_info *auditdata); + +#ifdef CONFIG_AUDIT + +/* + * some inline functions to set up audit data + * they do nothing if CONFIG_AUDIT is not set + * + */ +static inline void smk_ad_init(struct smk_audit_info *a, const char *func, + char type) +{ + memset(a, 0, sizeof(*a)); + a->a.type = type; + a->a.function = func; +} + +static inline void smk_ad_setfield_u_tsk(struct smk_audit_info *a, + struct task_struct *t) +{ + a->a.u.tsk = t; +} +static inline void smk_ad_setfield_u_fs_path_dentry(struct smk_audit_info *a, + struct dentry *d) +{ + a->a.u.fs.path.dentry = d; +} +static inline void smk_ad_setfield_u_fs_path_mnt(struct smk_audit_info *a, + struct vfsmount *m) +{ + a->a.u.fs.path.mnt = m; +} +static inline void smk_ad_setfield_u_fs_inode(struct smk_audit_info *a, + struct inode *i) +{ + a->a.u.fs.inode = i; +} +static inline void smk_ad_setfield_u_fs_path(struct smk_audit_info *a, + struct path p) +{ + a->a.u.fs.path = p; +} +static inline void smk_ad_setfield_u_net_sk(struct smk_audit_info *a, + struct sock *sk) +{ + a->a.u.net.sk = sk; +} + +#else /* no AUDIT */ + +static inline void smk_ad_init(struct smk_audit_info *a, const char *func, + char type) +{ +} +static inline void smk_ad_setfield_u_tsk(struct smk_audit_info *a, + struct task_struct *t) +{ +} +static inline void smk_ad_setfield_u_fs_path_dentry(struct smk_audit_info *a, + struct dentry *d) +{ +} +static inline void smk_ad_setfield_u_fs_path_mnt(struct smk_audit_info *a, + struct vfsmount *m) +{ +} +static inline void smk_ad_setfield_u_fs_inode(struct smk_audit_info *a, + struct inode *i) +{ +} +static inline void smk_ad_setfield_u_fs_path(struct smk_audit_info *a, + struct path p) +{ +} +static inline void smk_ad_setfield_u_net_sk(struct smk_audit_info *a, + struct sock *sk) +{ +} +#endif + #endif /* _SECURITY_SMACK_H */ diff --git a/security/smack/smack_access.c b/security/smack/smack_access.c index ac0a2707f6d4..513dc1aa16dd 100644 --- a/security/smack/smack_access.c +++ b/security/smack/smack_access.c @@ -59,11 +59,18 @@ LIST_HEAD(smack_known_list); */ static u32 smack_next_secid = 10; +/* + * what events do we log + * can be overwritten at run-time by /smack/logging + */ +int log_policy = SMACK_AUDIT_DENIED; + /** * smk_access - determine if a subject has a specific access to an object * @subject_label: a pointer to the subject's Smack label * @object_label: a pointer to the object's Smack label * @request: the access requested, in "MAY" format + * @a : a pointer to the audit data * * This function looks up the subject/object pair in the * access rule list and returns 0 if the access is permitted, @@ -78,10 +85,12 @@ static u32 smack_next_secid = 10; * will be on the list, so checking the pointers may be a worthwhile * optimization. */ -int smk_access(char *subject_label, char *object_label, int request) +int smk_access(char *subject_label, char *object_label, int request, + struct smk_audit_info *a) { u32 may = MAY_NOT; struct smack_rule *srp; + int rc = 0; /* * Hardcoded comparisons. @@ -89,8 +98,10 @@ int smk_access(char *subject_label, char *object_label, int request) * A star subject can't access any object. */ if (subject_label == smack_known_star.smk_known || - strcmp(subject_label, smack_known_star.smk_known) == 0) - return -EACCES; + strcmp(subject_label, smack_known_star.smk_known) == 0) { + rc = -EACCES; + goto out_audit; + } /* * An internet object can be accessed by any subject. * Tasks cannot be assigned the internet label. @@ -100,20 +111,20 @@ int smk_access(char *subject_label, char *object_label, int request) subject_label == smack_known_web.smk_known || strcmp(object_label, smack_known_web.smk_known) == 0 || strcmp(subject_label, smack_known_web.smk_known) == 0) - return 0; + goto out_audit; /* * A star object can be accessed by any subject. */ if (object_label == smack_known_star.smk_known || strcmp(object_label, smack_known_star.smk_known) == 0) - return 0; + goto out_audit; /* * An object can be accessed in any way by a subject * with the same label. */ if (subject_label == object_label || strcmp(subject_label, object_label) == 0) - return 0; + goto out_audit; /* * A hat subject can read any object. * A floor object can be read by any subject. @@ -121,10 +132,10 @@ int smk_access(char *subject_label, char *object_label, int request) if ((request & MAY_ANYREAD) == request) { if (object_label == smack_known_floor.smk_known || strcmp(object_label, smack_known_floor.smk_known) == 0) - return 0; + goto out_audit; if (subject_label == smack_known_hat.smk_known || strcmp(subject_label, smack_known_hat.smk_known) == 0) - return 0; + goto out_audit; } /* * Beyond here an explicit relationship is required. @@ -148,28 +159,36 @@ int smk_access(char *subject_label, char *object_label, int request) * This is a bit map operation. */ if ((request & may) == request) - return 0; - - return -EACCES; + goto out_audit; + + rc = -EACCES; +out_audit: +#ifdef CONFIG_AUDIT + if (a) + smack_log(subject_label, object_label, request, rc, a); +#endif + return rc; } /** * smk_curacc - determine if current has a specific access to an object * @obj_label: a pointer to the object's Smack label * @mode: the access requested, in "MAY" format + * @a : common audit data * * This function checks the current subject label/object label pair * in the access rule list and returns 0 if the access is permitted, * non zero otherwise. It allows that current may have the capability * to override the rules. */ -int smk_curacc(char *obj_label, u32 mode) +int smk_curacc(char *obj_label, u32 mode, struct smk_audit_info *a) { int rc; + char *sp = current_security(); - rc = smk_access(current_security(), obj_label, mode); + rc = smk_access(sp, obj_label, mode, NULL); if (rc == 0) - return 0; + goto out_audit; /* * Return if a specific label has been designated as the @@ -177,14 +196,105 @@ int smk_curacc(char *obj_label, u32 mode) * have that label. */ if (smack_onlycap != NULL && smack_onlycap != current->cred->security) - return rc; + goto out_audit; if (capable(CAP_MAC_OVERRIDE)) return 0; +out_audit: +#ifdef CONFIG_AUDIT + if (a) + smack_log(sp, obj_label, mode, rc, a); +#endif return rc; } +#ifdef CONFIG_AUDIT +/** + * smack_str_from_perm : helper to transalate an int to a + * readable string + * @string : the string to fill + * @access : the int + * + */ +static inline void smack_str_from_perm(char *string, int access) +{ + int i = 0; + if (access & MAY_READ) + string[i++] = 'r'; + if (access & MAY_WRITE) + string[i++] = 'w'; + if (access & MAY_EXEC) + string[i++] = 'x'; + if (access & MAY_APPEND) + string[i++] = 'a'; + string[i] = '\0'; +} +/** + * smack_log_callback - SMACK specific information + * will be called by generic audit code + * @ab : the audit_buffer + * @a : audit_data + * + */ +static void smack_log_callback(struct audit_buffer *ab, void *a) +{ + struct common_audit_data *ad = a; + struct smack_audit_data *sad = &ad->lsm_priv.smack_audit_data; + audit_log_format(ab, "lsm=SMACK fn=%s action=%s", ad->function, + sad->result ? "denied" : "granted"); + audit_log_format(ab, " subject="); + audit_log_untrustedstring(ab, sad->subject); + audit_log_format(ab, " object="); + audit_log_untrustedstring(ab, sad->object); + audit_log_format(ab, " requested=%s", sad->request); +} + +/** + * smack_log - Audit the granting or denial of permissions. + * @subject_label : smack label of the requester + * @object_label : smack label of the object being accessed + * @request: requested permissions + * @result: result from smk_access + * @a: auxiliary audit data + * + * Audit the granting or denial of permissions in accordance + * with the policy. + */ +void smack_log(char *subject_label, char *object_label, int request, + int result, struct smk_audit_info *ad) +{ + char request_buffer[SMK_NUM_ACCESS_TYPE + 1]; + struct smack_audit_data *sad; + struct common_audit_data *a = &ad->a; + + /* check if we have to log the current event */ + if (result != 0 && (log_policy & SMACK_AUDIT_DENIED) == 0) + return; + if (result == 0 && (log_policy & SMACK_AUDIT_ACCEPT) == 0) + return; + + if (a->function == NULL) + a->function = "unknown"; + + /* end preparing the audit data */ + sad = &a->lsm_priv.smack_audit_data; + smack_str_from_perm(request_buffer, request); + sad->subject = subject_label; + sad->object = object_label; + sad->request = request_buffer; + sad->result = result; + a->lsm_pre_audit = smack_log_callback; + + common_lsm_audit(a); +} +#else /* #ifdef CONFIG_AUDIT */ +void smack_log(char *subject_label, char *object_label, int request, + int result, struct smk_audit_info *ad) +{ +} +#endif + static DEFINE_MUTEX(smack_known_lock); /** @@ -209,7 +319,8 @@ struct smack_known *smk_import_entry(const char *string, int len) if (found) smack[i] = '\0'; else if (i >= len || string[i] > '~' || string[i] <= ' ' || - string[i] == '/') { + string[i] == '/' || string[i] == '"' || + string[i] == '\\' || string[i] == '\'') { smack[i] = '\0'; found = 1; } else diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c index 921514902eca..f557767911c9 100644 --- a/security/smack/smack_lsm.c +++ b/security/smack/smack_lsm.c @@ -30,7 +30,6 @@ #include #include #include - #include "smack.h" #define task_security(task) (task_cred_xxx((task), security)) @@ -103,14 +102,24 @@ struct inode_smack *new_inode_smack(char *smack) static int smack_ptrace_may_access(struct task_struct *ctp, unsigned int mode) { int rc; + struct smk_audit_info ad; + char *sp, *tsp; rc = cap_ptrace_may_access(ctp, mode); if (rc != 0) return rc; - rc = smk_access(current_security(), task_security(ctp), MAY_READWRITE); + sp = current_security(); + tsp = task_security(ctp); + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_TASK); + smk_ad_setfield_u_tsk(&ad, ctp); + + /* we won't log here, because rc can be overriden */ + rc = smk_access(sp, tsp, MAY_READWRITE, NULL); if (rc != 0 && capable(CAP_MAC_OVERRIDE)) - return 0; + rc = 0; + + smack_log(sp, tsp, MAY_READWRITE, rc, &ad); return rc; } @@ -125,14 +134,24 @@ static int smack_ptrace_may_access(struct task_struct *ctp, unsigned int mode) static int smack_ptrace_traceme(struct task_struct *ptp) { int rc; + struct smk_audit_info ad; + char *sp, *tsp; rc = cap_ptrace_traceme(ptp); if (rc != 0) return rc; - rc = smk_access(task_security(ptp), current_security(), MAY_READWRITE); + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_TASK); + smk_ad_setfield_u_tsk(&ad, ptp); + + sp = current_security(); + tsp = task_security(ptp); + /* we won't log here, because rc can be overriden */ + rc = smk_access(tsp, sp, MAY_READWRITE, NULL); if (rc != 0 && has_capability(ptp, CAP_MAC_OVERRIDE)) - return 0; + rc = 0; + + smack_log(tsp, sp, MAY_READWRITE, rc, &ad); return rc; } @@ -327,8 +346,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data) static int smack_sb_statfs(struct dentry *dentry) { struct superblock_smack *sbp = dentry->d_sb->s_security; + int rc; + struct smk_audit_info ad; - return smk_curacc(sbp->smk_floor, MAY_READ); + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_FS); + smk_ad_setfield_u_fs_path_dentry(&ad, dentry); + + rc = smk_curacc(sbp->smk_floor, MAY_READ, &ad); + return rc; } /** @@ -346,8 +371,12 @@ static int smack_sb_mount(char *dev_name, struct path *path, char *type, unsigned long flags, void *data) { struct superblock_smack *sbp = path->mnt->mnt_sb->s_security; + struct smk_audit_info ad; + + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_FS); + smk_ad_setfield_u_fs_path(&ad, *path); - return smk_curacc(sbp->smk_floor, MAY_WRITE); + return smk_curacc(sbp->smk_floor, MAY_WRITE, &ad); } /** @@ -361,10 +390,14 @@ static int smack_sb_mount(char *dev_name, struct path *path, static int smack_sb_umount(struct vfsmount *mnt, int flags) { struct superblock_smack *sbp; + struct smk_audit_info ad; - sbp = mnt->mnt_sb->s_security; + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_FS); + smk_ad_setfield_u_fs_path_dentry(&ad, mnt->mnt_mountpoint); + smk_ad_setfield_u_fs_path_mnt(&ad, mnt); - return smk_curacc(sbp->smk_floor, MAY_WRITE); + sbp = mnt->mnt_sb->s_security; + return smk_curacc(sbp->smk_floor, MAY_WRITE, &ad); } /* @@ -441,15 +474,20 @@ static int smack_inode_init_security(struct inode *inode, struct inode *dir, static int smack_inode_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry) { - int rc; char *isp; + struct smk_audit_info ad; + int rc; + + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_FS); + smk_ad_setfield_u_fs_path_dentry(&ad, old_dentry); isp = smk_of_inode(old_dentry->d_inode); - rc = smk_curacc(isp, MAY_WRITE); + rc = smk_curacc(isp, MAY_WRITE, &ad); if (rc == 0 && new_dentry->d_inode != NULL) { isp = smk_of_inode(new_dentry->d_inode); - rc = smk_curacc(isp, MAY_WRITE); + smk_ad_setfield_u_fs_path_dentry(&ad, new_dentry); + rc = smk_curacc(isp, MAY_WRITE, &ad); } return rc; @@ -466,18 +504,24 @@ static int smack_inode_link(struct dentry *old_dentry, struct inode *dir, static int smack_inode_unlink(struct inode *dir, struct dentry *dentry) { struct inode *ip = dentry->d_inode; + struct smk_audit_info ad; int rc; + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_FS); + smk_ad_setfield_u_fs_path_dentry(&ad, dentry); + /* * You need write access to the thing you're unlinking */ - rc = smk_curacc(smk_of_inode(ip), MAY_WRITE); - if (rc == 0) + rc = smk_curacc(smk_of_inode(ip), MAY_WRITE, &ad); + if (rc == 0) { /* * You also need write access to the containing directory */ - rc = smk_curacc(smk_of_inode(dir), MAY_WRITE); - + smk_ad_setfield_u_fs_path_dentry(&ad, NULL); + smk_ad_setfield_u_fs_inode(&ad, dir); + rc = smk_curacc(smk_of_inode(dir), MAY_WRITE, &ad); + } return rc; } @@ -491,17 +535,24 @@ static int smack_inode_unlink(struct inode *dir, struct dentry *dentry) */ static int smack_inode_rmdir(struct inode *dir, struct dentry *dentry) { + struct smk_audit_info ad; int rc; + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_FS); + smk_ad_setfield_u_fs_path_dentry(&ad, dentry); + /* * You need write access to the thing you're removing */ - rc = smk_curacc(smk_of_inode(dentry->d_inode), MAY_WRITE); - if (rc == 0) + rc = smk_curacc(smk_of_inode(dentry->d_inode), MAY_WRITE, &ad); + if (rc == 0) { /* * You also need write access to the containing directory */ - rc = smk_curacc(smk_of_inode(dir), MAY_WRITE); + smk_ad_setfield_u_fs_path_dentry(&ad, NULL); + smk_ad_setfield_u_fs_inode(&ad, dir); + rc = smk_curacc(smk_of_inode(dir), MAY_WRITE, &ad); + } return rc; } @@ -525,15 +576,19 @@ static int smack_inode_rename(struct inode *old_inode, { int rc; char *isp; + struct smk_audit_info ad; + + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_FS); + smk_ad_setfield_u_fs_path_dentry(&ad, old_dentry); isp = smk_of_inode(old_dentry->d_inode); - rc = smk_curacc(isp, MAY_READWRITE); + rc = smk_curacc(isp, MAY_READWRITE, &ad); if (rc == 0 && new_dentry->d_inode != NULL) { isp = smk_of_inode(new_dentry->d_inode); - rc = smk_curacc(isp, MAY_READWRITE); + smk_ad_setfield_u_fs_path_dentry(&ad, new_dentry); + rc = smk_curacc(isp, MAY_READWRITE, &ad); } - return rc; } @@ -548,13 +603,15 @@ static int smack_inode_rename(struct inode *old_inode, */ static int smack_inode_permission(struct inode *inode, int mask) { + struct smk_audit_info ad; /* * No permission to check. Existence test. Yup, it's there. */ if (mask == 0) return 0; - - return smk_curacc(smk_of_inode(inode), mask); + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_FS); + smk_ad_setfield_u_fs_inode(&ad, inode); + return smk_curacc(smk_of_inode(inode), mask, &ad); } /** @@ -566,13 +623,16 @@ static int smack_inode_permission(struct inode *inode, int mask) */ static int smack_inode_setattr(struct dentry *dentry, struct iattr *iattr) { + struct smk_audit_info ad; /* * Need to allow for clearing the setuid bit. */ if (iattr->ia_valid & ATTR_FORCE) return 0; + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_FS); + smk_ad_setfield_u_fs_path_dentry(&ad, dentry); - return smk_curacc(smk_of_inode(dentry->d_inode), MAY_WRITE); + return smk_curacc(smk_of_inode(dentry->d_inode), MAY_WRITE, &ad); } /** @@ -584,7 +644,12 @@ static int smack_inode_setattr(struct dentry *dentry, struct iattr *iattr) */ static int smack_inode_getattr(struct vfsmount *mnt, struct dentry *dentry) { - return smk_curacc(smk_of_inode(dentry->d_inode), MAY_READ); + struct smk_audit_info ad; + + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_FS); + smk_ad_setfield_u_fs_path_dentry(&ad, dentry); + smk_ad_setfield_u_fs_path_mnt(&ad, mnt); + return smk_curacc(smk_of_inode(dentry->d_inode), MAY_READ, &ad); } /** @@ -602,6 +667,7 @@ static int smack_inode_getattr(struct vfsmount *mnt, struct dentry *dentry) static int smack_inode_setxattr(struct dentry *dentry, const char *name, const void *value, size_t size, int flags) { + struct smk_audit_info ad; int rc = 0; if (strcmp(name, XATTR_NAME_SMACK) == 0 || @@ -615,8 +681,11 @@ static int smack_inode_setxattr(struct dentry *dentry, const char *name, } else rc = cap_inode_setxattr(dentry, name, value, size, flags); + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_FS); + smk_ad_setfield_u_fs_path_dentry(&ad, dentry); + if (rc == 0) - rc = smk_curacc(smk_of_inode(dentry->d_inode), MAY_WRITE); + rc = smk_curacc(smk_of_inode(dentry->d_inode), MAY_WRITE, &ad); return rc; } @@ -671,7 +740,12 @@ static void smack_inode_post_setxattr(struct dentry *dentry, const char *name, */ static int smack_inode_getxattr(struct dentry *dentry, const char *name) { - return smk_curacc(smk_of_inode(dentry->d_inode), MAY_READ); + struct smk_audit_info ad; + + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_FS); + smk_ad_setfield_u_fs_path_dentry(&ad, dentry); + + return smk_curacc(smk_of_inode(dentry->d_inode), MAY_READ, &ad); } /* @@ -685,6 +759,7 @@ static int smack_inode_getxattr(struct dentry *dentry, const char *name) */ static int smack_inode_removexattr(struct dentry *dentry, const char *name) { + struct smk_audit_info ad; int rc = 0; if (strcmp(name, XATTR_NAME_SMACK) == 0 || @@ -695,8 +770,10 @@ static int smack_inode_removexattr(struct dentry *dentry, const char *name) } else rc = cap_inode_removexattr(dentry, name); + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_FS); + smk_ad_setfield_u_fs_path_dentry(&ad, dentry); if (rc == 0) - rc = smk_curacc(smk_of_inode(dentry->d_inode), MAY_WRITE); + rc = smk_curacc(smk_of_inode(dentry->d_inode), MAY_WRITE, &ad); return rc; } @@ -855,12 +932,16 @@ static int smack_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { int rc = 0; + struct smk_audit_info ad; + + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_FS); + smk_ad_setfield_u_fs_path(&ad, file->f_path); if (_IOC_DIR(cmd) & _IOC_WRITE) - rc = smk_curacc(file->f_security, MAY_WRITE); + rc = smk_curacc(file->f_security, MAY_WRITE, &ad); if (rc == 0 && (_IOC_DIR(cmd) & _IOC_READ)) - rc = smk_curacc(file->f_security, MAY_READ); + rc = smk_curacc(file->f_security, MAY_READ, &ad); return rc; } @@ -874,7 +955,11 @@ static int smack_file_ioctl(struct file *file, unsigned int cmd, */ static int smack_file_lock(struct file *file, unsigned int cmd) { - return smk_curacc(file->f_security, MAY_WRITE); + struct smk_audit_info ad; + + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_FS); + smk_ad_setfield_u_fs_path_dentry(&ad, file->f_path.dentry); + return smk_curacc(file->f_security, MAY_WRITE, &ad); } /** @@ -888,8 +973,12 @@ static int smack_file_lock(struct file *file, unsigned int cmd) static int smack_file_fcntl(struct file *file, unsigned int cmd, unsigned long arg) { + struct smk_audit_info ad; int rc; + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_FS); + smk_ad_setfield_u_fs_path(&ad, file->f_path); + switch (cmd) { case F_DUPFD: case F_GETFD: @@ -897,7 +986,7 @@ static int smack_file_fcntl(struct file *file, unsigned int cmd, case F_GETLK: case F_GETOWN: case F_GETSIG: - rc = smk_curacc(file->f_security, MAY_READ); + rc = smk_curacc(file->f_security, MAY_READ, &ad); break; case F_SETFD: case F_SETFL: @@ -905,10 +994,10 @@ static int smack_file_fcntl(struct file *file, unsigned int cmd, case F_SETLKW: case F_SETOWN: case F_SETSIG: - rc = smk_curacc(file->f_security, MAY_WRITE); + rc = smk_curacc(file->f_security, MAY_WRITE, &ad); break; default: - rc = smk_curacc(file->f_security, MAY_READWRITE); + rc = smk_curacc(file->f_security, MAY_READWRITE, &ad); } return rc; @@ -943,14 +1032,21 @@ static int smack_file_send_sigiotask(struct task_struct *tsk, { struct file *file; int rc; + char *tsp = tsk->cred->security; + struct smk_audit_info ad; /* * struct fown_struct is never outside the context of a struct file */ file = container_of(fown, struct file, f_owner); - rc = smk_access(file->f_security, tsk->cred->security, MAY_WRITE); + /* we don't log here as rc can be overriden */ + rc = smk_access(file->f_security, tsp, MAY_WRITE, NULL); if (rc != 0 && has_capability(tsk, CAP_MAC_OVERRIDE)) - return 0; + rc = 0; + + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_TASK); + smk_ad_setfield_u_tsk(&ad, tsk); + smack_log(file->f_security, tsp, MAY_WRITE, rc, &ad); return rc; } @@ -963,7 +1059,10 @@ static int smack_file_send_sigiotask(struct task_struct *tsk, static int smack_file_receive(struct file *file) { int may = 0; + struct smk_audit_info ad; + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_TASK); + smk_ad_setfield_u_fs_path(&ad, file->f_path); /* * This code relies on bitmasks. */ @@ -972,7 +1071,7 @@ static int smack_file_receive(struct file *file) if (file->f_mode & FMODE_WRITE) may |= MAY_WRITE; - return smk_curacc(file->f_security, may); + return smk_curacc(file->f_security, may, &ad); } /* @@ -1051,6 +1150,22 @@ static int smack_kernel_create_files_as(struct cred *new, return 0; } +/** + * smk_curacc_on_task - helper to log task related access + * @p: the task object + * @access : the access requested + * + * Return 0 if access is permitted + */ +static int smk_curacc_on_task(struct task_struct *p, int access) +{ + struct smk_audit_info ad; + + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_TASK); + smk_ad_setfield_u_tsk(&ad, p); + return smk_curacc(task_security(p), access, &ad); +} + /** * smack_task_setpgid - Smack check on setting pgid * @p: the task object @@ -1060,7 +1175,7 @@ static int smack_kernel_create_files_as(struct cred *new, */ static int smack_task_setpgid(struct task_struct *p, pid_t pgid) { - return smk_curacc(task_security(p), MAY_WRITE); + return smk_curacc_on_task(p, MAY_WRITE); } /** @@ -1071,7 +1186,7 @@ static int smack_task_setpgid(struct task_struct *p, pid_t pgid) */ static int smack_task_getpgid(struct task_struct *p) { - return smk_curacc(task_security(p), MAY_READ); + return smk_curacc_on_task(p, MAY_READ); } /** @@ -1082,7 +1197,7 @@ static int smack_task_getpgid(struct task_struct *p) */ static int smack_task_getsid(struct task_struct *p) { - return smk_curacc(task_security(p), MAY_READ); + return smk_curacc_on_task(p, MAY_READ); } /** @@ -1110,7 +1225,7 @@ static int smack_task_setnice(struct task_struct *p, int nice) rc = cap_task_setnice(p, nice); if (rc == 0) - rc = smk_curacc(task_security(p), MAY_WRITE); + rc = smk_curacc_on_task(p, MAY_WRITE); return rc; } @@ -1127,7 +1242,7 @@ static int smack_task_setioprio(struct task_struct *p, int ioprio) rc = cap_task_setioprio(p, ioprio); if (rc == 0) - rc = smk_curacc(task_security(p), MAY_WRITE); + rc = smk_curacc_on_task(p, MAY_WRITE); return rc; } @@ -1139,7 +1254,7 @@ static int smack_task_setioprio(struct task_struct *p, int ioprio) */ static int smack_task_getioprio(struct task_struct *p) { - return smk_curacc(task_security(p), MAY_READ); + return smk_curacc_on_task(p, MAY_READ); } /** @@ -1157,7 +1272,7 @@ static int smack_task_setscheduler(struct task_struct *p, int policy, rc = cap_task_setscheduler(p, policy, lp); if (rc == 0) - rc = smk_curacc(task_security(p), MAY_WRITE); + rc = smk_curacc_on_task(p, MAY_WRITE); return rc; } @@ -1169,7 +1284,7 @@ static int smack_task_setscheduler(struct task_struct *p, int policy, */ static int smack_task_getscheduler(struct task_struct *p) { - return smk_curacc(task_security(p), MAY_READ); + return smk_curacc_on_task(p, MAY_READ); } /** @@ -1180,7 +1295,7 @@ static int smack_task_getscheduler(struct task_struct *p) */ static int smack_task_movememory(struct task_struct *p) { - return smk_curacc(task_security(p), MAY_WRITE); + return smk_curacc_on_task(p, MAY_WRITE); } /** @@ -1198,18 +1313,23 @@ static int smack_task_movememory(struct task_struct *p) static int smack_task_kill(struct task_struct *p, struct siginfo *info, int sig, u32 secid) { + struct smk_audit_info ad; + + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_TASK); + smk_ad_setfield_u_tsk(&ad, p); /* * Sending a signal requires that the sender * can write the receiver. */ if (secid == 0) - return smk_curacc(task_security(p), MAY_WRITE); + return smk_curacc(task_security(p), MAY_WRITE, &ad); /* * If the secid isn't 0 we're dealing with some USB IO * specific behavior. This is not clean. For one thing * we can't take privilege into account. */ - return smk_access(smack_from_secid(secid), task_security(p), MAY_WRITE); + return smk_access(smack_from_secid(secid), task_security(p), + MAY_WRITE, &ad); } /** @@ -1220,11 +1340,15 @@ static int smack_task_kill(struct task_struct *p, struct siginfo *info, */ static int smack_task_wait(struct task_struct *p) { + struct smk_audit_info ad; + char *sp = current_security(); + char *tsp = task_security(p); int rc; - rc = smk_access(current_security(), task_security(p), MAY_WRITE); + /* we don't log here, we can be overriden */ + rc = smk_access(sp, tsp, MAY_WRITE, NULL); if (rc == 0) - return 0; + goto out_log; /* * Allow the operation to succeed if either task @@ -1238,8 +1362,12 @@ static int smack_task_wait(struct task_struct *p) * the smack value. */ if (capable(CAP_MAC_OVERRIDE) || has_capability(p, CAP_MAC_OVERRIDE)) - return 0; - + rc = 0; + /* we log only if we didn't get overriden */ + out_log: + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_TASK); + smk_ad_setfield_u_tsk(&ad, p); + smack_log(sp, tsp, MAY_WRITE, rc, &ad); return rc; } @@ -1455,12 +1583,19 @@ static int smack_netlabel_send(struct sock *sk, struct sockaddr_in *sap) int sk_lbl; char *hostsp; struct socket_smack *ssp = sk->sk_security; + struct smk_audit_info ad; rcu_read_lock(); hostsp = smack_host_label(sap); if (hostsp != NULL) { sk_lbl = SMACK_UNLABELED_SOCKET; - rc = smk_access(ssp->smk_out, hostsp, MAY_WRITE); +#ifdef CONFIG_AUDIT + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_NET); + ad.a.u.net.family = sap->sin_family; + ad.a.u.net.dport = sap->sin_port; + ad.a.u.net.v4info.daddr = sap->sin_addr.s_addr; +#endif + rc = smk_access(ssp->smk_out, hostsp, MAY_WRITE, &ad); } else { sk_lbl = SMACK_CIPSO_SOCKET; rc = 0; @@ -1655,6 +1790,25 @@ static void smack_shm_free_security(struct shmid_kernel *shp) isp->security = NULL; } +/** + * smk_curacc_shm : check if current has access on shm + * @shp : the object + * @access : access requested + * + * Returns 0 if current has the requested access, error code otherwise + */ +static int smk_curacc_shm(struct shmid_kernel *shp, int access) +{ + char *ssp = smack_of_shm(shp); + struct smk_audit_info ad; + +#ifdef CONFIG_AUDIT + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_IPC); + ad.a.u.ipc_id = shp->shm_perm.id; +#endif + return smk_curacc(ssp, access, &ad); +} + /** * smack_shm_associate - Smack access check for shm * @shp: the object @@ -1664,11 +1818,10 @@ static void smack_shm_free_security(struct shmid_kernel *shp) */ static int smack_shm_associate(struct shmid_kernel *shp, int shmflg) { - char *ssp = smack_of_shm(shp); int may; may = smack_flags_to_may(shmflg); - return smk_curacc(ssp, may); + return smk_curacc_shm(shp, may); } /** @@ -1680,7 +1833,6 @@ static int smack_shm_associate(struct shmid_kernel *shp, int shmflg) */ static int smack_shm_shmctl(struct shmid_kernel *shp, int cmd) { - char *ssp; int may; switch (cmd) { @@ -1703,9 +1855,7 @@ static int smack_shm_shmctl(struct shmid_kernel *shp, int cmd) default: return -EINVAL; } - - ssp = smack_of_shm(shp); - return smk_curacc(ssp, may); + return smk_curacc_shm(shp, may); } /** @@ -1719,11 +1869,10 @@ static int smack_shm_shmctl(struct shmid_kernel *shp, int cmd) static int smack_shm_shmat(struct shmid_kernel *shp, char __user *shmaddr, int shmflg) { - char *ssp = smack_of_shm(shp); int may; may = smack_flags_to_may(shmflg); - return smk_curacc(ssp, may); + return smk_curacc_shm(shp, may); } /** @@ -1764,6 +1913,25 @@ static void smack_sem_free_security(struct sem_array *sma) isp->security = NULL; } +/** + * smk_curacc_sem : check if current has access on sem + * @sma : the object + * @access : access requested + * + * Returns 0 if current has the requested access, error code otherwise + */ +static int smk_curacc_sem(struct sem_array *sma, int access) +{ + char *ssp = smack_of_sem(sma); + struct smk_audit_info ad; + +#ifdef CONFIG_AUDIT + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_IPC); + ad.a.u.ipc_id = sma->sem_perm.id; +#endif + return smk_curacc(ssp, access, &ad); +} + /** * smack_sem_associate - Smack access check for sem * @sma: the object @@ -1773,11 +1941,10 @@ static void smack_sem_free_security(struct sem_array *sma) */ static int smack_sem_associate(struct sem_array *sma, int semflg) { - char *ssp = smack_of_sem(sma); int may; may = smack_flags_to_may(semflg); - return smk_curacc(ssp, may); + return smk_curacc_sem(sma, may); } /** @@ -1789,7 +1956,6 @@ static int smack_sem_associate(struct sem_array *sma, int semflg) */ static int smack_sem_semctl(struct sem_array *sma, int cmd) { - char *ssp; int may; switch (cmd) { @@ -1818,8 +1984,7 @@ static int smack_sem_semctl(struct sem_array *sma, int cmd) return -EINVAL; } - ssp = smack_of_sem(sma); - return smk_curacc(ssp, may); + return smk_curacc_sem(sma, may); } /** @@ -1836,9 +2001,7 @@ static int smack_sem_semctl(struct sem_array *sma, int cmd) static int smack_sem_semop(struct sem_array *sma, struct sembuf *sops, unsigned nsops, int alter) { - char *ssp = smack_of_sem(sma); - - return smk_curacc(ssp, MAY_READWRITE); + return smk_curacc_sem(sma, MAY_READWRITE); } /** @@ -1879,6 +2042,25 @@ static char *smack_of_msq(struct msg_queue *msq) return (char *)msq->q_perm.security; } +/** + * smk_curacc_msq : helper to check if current has access on msq + * @msq : the msq + * @access : access requested + * + * return 0 if current has access, error otherwise + */ +static int smk_curacc_msq(struct msg_queue *msq, int access) +{ + char *msp = smack_of_msq(msq); + struct smk_audit_info ad; + +#ifdef CONFIG_AUDIT + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_IPC); + ad.a.u.ipc_id = msq->q_perm.id; +#endif + return smk_curacc(msp, access, &ad); +} + /** * smack_msg_queue_associate - Smack access check for msg_queue * @msq: the object @@ -1888,11 +2070,10 @@ static char *smack_of_msq(struct msg_queue *msq) */ static int smack_msg_queue_associate(struct msg_queue *msq, int msqflg) { - char *msp = smack_of_msq(msq); int may; may = smack_flags_to_may(msqflg); - return smk_curacc(msp, may); + return smk_curacc_msq(msq, may); } /** @@ -1904,7 +2085,6 @@ static int smack_msg_queue_associate(struct msg_queue *msq, int msqflg) */ static int smack_msg_queue_msgctl(struct msg_queue *msq, int cmd) { - char *msp; int may; switch (cmd) { @@ -1926,8 +2106,7 @@ static int smack_msg_queue_msgctl(struct msg_queue *msq, int cmd) return -EINVAL; } - msp = smack_of_msq(msq); - return smk_curacc(msp, may); + return smk_curacc_msq(msq, may); } /** @@ -1941,11 +2120,10 @@ static int smack_msg_queue_msgctl(struct msg_queue *msq, int cmd) static int smack_msg_queue_msgsnd(struct msg_queue *msq, struct msg_msg *msg, int msqflg) { - char *msp = smack_of_msq(msq); - int rc; + int may; - rc = smack_flags_to_may(msqflg); - return smk_curacc(msp, rc); + may = smack_flags_to_may(msqflg); + return smk_curacc_msq(msq, may); } /** @@ -1961,9 +2139,7 @@ static int smack_msg_queue_msgsnd(struct msg_queue *msq, struct msg_msg *msg, static int smack_msg_queue_msgrcv(struct msg_queue *msq, struct msg_msg *msg, struct task_struct *target, long type, int mode) { - char *msp = smack_of_msq(msq); - - return smk_curacc(msp, MAY_READWRITE); + return smk_curacc_msq(msq, MAY_READWRITE); } /** @@ -1976,10 +2152,14 @@ static int smack_msg_queue_msgrcv(struct msg_queue *msq, struct msg_msg *msg, static int smack_ipc_permission(struct kern_ipc_perm *ipp, short flag) { char *isp = ipp->security; - int may; + int may = smack_flags_to_may(flag); + struct smk_audit_info ad; - may = smack_flags_to_may(flag); - return smk_curacc(isp, may); +#ifdef CONFIG_AUDIT + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_IPC); + ad.a.u.ipc_id = ipp->id; +#endif + return smk_curacc(isp, may, &ad); } /** @@ -2238,8 +2418,12 @@ static int smack_unix_stream_connect(struct socket *sock, { struct inode *sp = SOCK_INODE(sock); struct inode *op = SOCK_INODE(other); + struct smk_audit_info ad; - return smk_access(smk_of_inode(sp), smk_of_inode(op), MAY_READWRITE); + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_NET); + smk_ad_setfield_u_net_sk(&ad, other->sk); + return smk_access(smk_of_inode(sp), smk_of_inode(op), + MAY_READWRITE, &ad); } /** @@ -2254,8 +2438,11 @@ static int smack_unix_may_send(struct socket *sock, struct socket *other) { struct inode *sp = SOCK_INODE(sock); struct inode *op = SOCK_INODE(other); + struct smk_audit_info ad; - return smk_access(smk_of_inode(sp), smk_of_inode(op), MAY_WRITE); + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_NET); + smk_ad_setfield_u_net_sk(&ad, other->sk); + return smk_access(smk_of_inode(sp), smk_of_inode(op), MAY_WRITE, &ad); } /** @@ -2370,7 +2557,7 @@ static int smack_socket_sock_rcv_skb(struct sock *sk, struct sk_buff *skb) char smack[SMK_LABELLEN]; char *csp; int rc; - + struct smk_audit_info ad; if (sk->sk_family != PF_INET && sk->sk_family != PF_INET6) return 0; @@ -2388,13 +2575,19 @@ static int smack_socket_sock_rcv_skb(struct sock *sk, struct sk_buff *skb) netlbl_secattr_destroy(&secattr); +#ifdef CONFIG_AUDIT + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_NET); + ad.a.u.net.family = sk->sk_family; + ad.a.u.net.netif = skb->iif; + ipv4_skb_to_auditdata(skb, &ad.a, NULL); +#endif /* * Receiving a packet requires that the other end * be able to write here. Read access is not required. * This is the simplist possible security model * for networking. */ - rc = smk_access(csp, ssp->smk_in, MAY_WRITE); + rc = smk_access(csp, ssp->smk_in, MAY_WRITE, &ad); if (rc != 0) netlbl_skbuff_err(skb, rc, 0); return rc; @@ -2523,6 +2716,7 @@ static int smack_inet_conn_request(struct sock *sk, struct sk_buff *skb, struct iphdr *hdr; char smack[SMK_LABELLEN]; int rc; + struct smk_audit_info ad; /* handle mapped IPv4 packets arriving via IPv6 sockets */ if (family == PF_INET6 && skb->protocol == htons(ETH_P_IP)) @@ -2536,11 +2730,17 @@ static int smack_inet_conn_request(struct sock *sk, struct sk_buff *skb, strncpy(smack, smack_known_huh.smk_known, SMK_MAXLEN); netlbl_secattr_destroy(&secattr); +#ifdef CONFIG_AUDIT + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_NET); + ad.a.u.net.family = family; + ad.a.u.net.netif = skb->iif; + ipv4_skb_to_auditdata(skb, &ad.a, NULL); +#endif /* * Receiving a packet requires that the other end be able to write * here. Read access is not required. */ - rc = smk_access(smack, ssp->smk_in, MAY_WRITE); + rc = smk_access(smack, ssp->smk_in, MAY_WRITE, &ad); if (rc != 0) return rc; @@ -2642,6 +2842,7 @@ static int smack_key_permission(key_ref_t key_ref, const struct cred *cred, key_perm_t perm) { struct key *keyp; + struct smk_audit_info ad; keyp = key_ref_to_ptr(key_ref); if (keyp == NULL) @@ -2657,8 +2858,13 @@ static int smack_key_permission(key_ref_t key_ref, */ if (cred->security == NULL) return -EACCES; - - return smk_access(cred->security, keyp->security, MAY_READWRITE); +#ifdef CONFIG_AUDIT + smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_KEY); + ad.a.u.key_struct.key = keyp->serial; + ad.a.u.key_struct.key_desc = keyp->description; +#endif + return smk_access(cred->security, keyp->security, + MAY_READWRITE, &ad); } #endif /* CONFIG_KEYS */ diff --git a/security/smack/smackfs.c b/security/smack/smackfs.c index e03a7e19c73b..904af3483286 100644 --- a/security/smack/smackfs.c +++ b/security/smack/smackfs.c @@ -41,6 +41,7 @@ enum smk_inos { SMK_AMBIENT = 7, /* internet ambient label */ SMK_NETLBLADDR = 8, /* single label hosts */ SMK_ONLYCAP = 9, /* the only "capable" label */ + SMK_LOGGING = 10, /* logging */ }; /* @@ -1191,6 +1192,69 @@ static const struct file_operations smk_onlycap_ops = { .write = smk_write_onlycap, }; +/** + * smk_read_logging - read() for /smack/logging + * @filp: file pointer, not actually used + * @buf: where to put the result + * @cn: maximum to send along + * @ppos: where to start + * + * Returns number of bytes read or error code, as appropriate + */ +static ssize_t smk_read_logging(struct file *filp, char __user *buf, + size_t count, loff_t *ppos) +{ + char temp[32]; + ssize_t rc; + + if (*ppos != 0) + return 0; + + sprintf(temp, "%d\n", log_policy); + rc = simple_read_from_buffer(buf, count, ppos, temp, strlen(temp)); + return rc; +} + +/** + * smk_write_logging - write() for /smack/logging + * @file: file pointer, not actually used + * @buf: where to get the data from + * @count: bytes sent + * @ppos: where to start + * + * Returns number of bytes written or error code, as appropriate + */ +static ssize_t smk_write_logging(struct file *file, const char __user *buf, + size_t count, loff_t *ppos) +{ + char temp[32]; + int i; + + if (!capable(CAP_MAC_ADMIN)) + return -EPERM; + + if (count >= sizeof(temp) || count == 0) + return -EINVAL; + + if (copy_from_user(temp, buf, count) != 0) + return -EFAULT; + + temp[count] = '\0'; + + if (sscanf(temp, "%d", &i) != 1) + return -EINVAL; + if (i < 0 || i > 3) + return -EINVAL; + log_policy = i; + return count; +} + + + +static const struct file_operations smk_logging_ops = { + .read = smk_read_logging, + .write = smk_write_logging, +}; /** * smk_fill_super - fill the /smackfs superblock * @sb: the empty superblock @@ -1221,6 +1285,8 @@ static int smk_fill_super(struct super_block *sb, void *data, int silent) {"netlabel", &smk_netlbladdr_ops, S_IRUGO|S_IWUSR}, [SMK_ONLYCAP] = {"onlycap", &smk_onlycap_ops, S_IRUGO|S_IWUSR}, + [SMK_LOGGING] = + {"logging", &smk_logging_ops, S_IRUGO|S_IWUSR}, /* last one */ {""} }; -- cgit v1.2.3 From 6fd9b3a40b82081d9e6490b0d7cd656e9a78a134 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 13 Apr 2009 21:31:18 -0700 Subject: rcu: Update RCU tracing documentation for __rcu_pending This patch updates the RCU documentation to reflect the changes in tracing made in the previous patch in the set. Located-by: Anton Blanchard Tested-by: Anton Blanchard Signed-off-by: Paul E. McKenney Cc: anton@samba.org Cc: akpm@linux-foundation.org Cc: dipankar@in.ibm.com Cc: manfred@colorfullife.com Cc: cl@linux-foundation.org Cc: josht@linux.vnet.ibm.com Cc: schamp@sgi.com Cc: niv@us.ibm.com Cc: dvhltc@us.ibm.com Cc: ego@in.ibm.com Cc: laijs@cn.fujitsu.com Cc: rostedt@goodmis.org Cc: peterz@infradead.org Cc: penberg@cs.helsinki.fi Cc: andi@firstfloor.org Cc: "Paul E. McKenney" LKML-Reference: <12396834792865-git-send-email-> Signed-off-by: Ingo Molnar --- Documentation/RCU/trace.txt | 102 ++++++++++++++++++++++++++++++++++---------- 1 file changed, 80 insertions(+), 22 deletions(-) (limited to 'Documentation') diff --git a/Documentation/RCU/trace.txt b/Documentation/RCU/trace.txt index 068848240a8b..02cced183b2d 100644 --- a/Documentation/RCU/trace.txt +++ b/Documentation/RCU/trace.txt @@ -192,23 +192,24 @@ rcu/rcuhier (which displays the struct rcu_node hierarchy). The output of "cat rcu/rcudata" looks as follows: rcu: - 0 c=4011 g=4012 pq=1 pqc=4011 qp=0 rpfq=1 rp=3c2a dt=23301/73 dn=2 df=1882 of=0 ri=2126 ql=2 b=10 - 1 c=4011 g=4012 pq=1 pqc=4011 qp=0 rpfq=3 rp=39a6 dt=78073/1 dn=2 df=1402 of=0 ri=1875 ql=46 b=10 - 2 c=4010 g=4010 pq=1 pqc=4010 qp=0 rpfq=-5 rp=1d12 dt=16646/0 dn=2 df=3140 of=0 ri=2080 ql=0 b=10 - 3 c=4012 g=4013 pq=1 pqc=4012 qp=1 rpfq=3 rp=2b50 dt=21159/1 dn=2 df=2230 of=0 ri=1923 ql=72 b=10 - 4 c=4012 g=4013 pq=1 pqc=4012 qp=1 rpfq=3 rp=1644 dt=5783/1 dn=2 df=3348 of=0 ri=2805 ql=7 b=10 - 5 c=4012 g=4013 pq=0 pqc=4011 qp=1 rpfq=3 rp=1aac dt=5879/1 dn=2 df=3140 of=0 ri=2066 ql=10 b=10 - 6 c=4012 g=4013 pq=1 pqc=4012 qp=1 rpfq=3 rp=ed8 dt=5847/1 dn=2 df=3797 of=0 ri=1266 ql=10 b=10 - 7 c=4012 g=4013 pq=1 pqc=4012 qp=1 rpfq=3 rp=1fa2 dt=6199/1 dn=2 df=2795 of=0 ri=2162 ql=28 b=10 +rcu: + 0 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=10951/1 dn=0 df=1101 of=0 ri=36 ql=0 b=10 + 1 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=16117/1 dn=0 df=1015 of=0 ri=0 ql=0 b=10 + 2 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=1445/1 dn=0 df=1839 of=0 ri=0 ql=0 b=10 + 3 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=6681/1 dn=0 df=1545 of=0 ri=0 ql=0 b=10 + 4 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=1003/1 dn=0 df=1992 of=0 ri=0 ql=0 b=10 + 5 c=17829 g=17830 pq=1 pqc=17829 qp=1 dt=3887/1 dn=0 df=3331 of=0 ri=4 ql=2 b=10 + 6 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=859/1 dn=0 df=3224 of=0 ri=0 ql=0 b=10 + 7 c=17829 g=17830 pq=0 pqc=17829 qp=1 dt=3761/1 dn=0 df=1818 of=0 ri=0 ql=2 b=10 rcu_bh: - 0 c=-268 g=-268 pq=1 pqc=-268 qp=0 rpfq=-145 rp=21d6 dt=23301/73 dn=2 df=0 of=0 ri=0 ql=0 b=10 - 1 c=-268 g=-268 pq=1 pqc=-268 qp=1 rpfq=-170 rp=20ce dt=78073/1 dn=2 df=26 of=0 ri=5 ql=0 b=10 - 2 c=-268 g=-268 pq=1 pqc=-268 qp=1 rpfq=-83 rp=fbd dt=16646/0 dn=2 df=28 of=0 ri=4 ql=0 b=10 - 3 c=-268 g=-268 pq=1 pqc=-268 qp=0 rpfq=-105 rp=178c dt=21159/1 dn=2 df=28 of=0 ri=2 ql=0 b=10 - 4 c=-268 g=-268 pq=1 pqc=-268 qp=1 rpfq=-30 rp=b54 dt=5783/1 dn=2 df=32 of=0 ri=0 ql=0 b=10 - 5 c=-268 g=-268 pq=1 pqc=-268 qp=1 rpfq=-29 rp=df5 dt=5879/1 dn=2 df=30 of=0 ri=3 ql=0 b=10 - 6 c=-268 g=-268 pq=1 pqc=-268 qp=1 rpfq=-28 rp=788 dt=5847/1 dn=2 df=32 of=0 ri=0 ql=0 b=10 - 7 c=-268 g=-268 pq=1 pqc=-268 qp=1 rpfq=-53 rp=1098 dt=6199/1 dn=2 df=30 of=0 ri=3 ql=0 b=10 + 0 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=10951/1 dn=0 df=0 of=0 ri=0 ql=0 b=10 + 1 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=16117/1 dn=0 df=13 of=0 ri=0 ql=0 b=10 + 2 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=1445/1 dn=0 df=15 of=0 ri=0 ql=0 b=10 + 3 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=6681/1 dn=0 df=9 of=0 ri=0 ql=0 b=10 + 4 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=1003/1 dn=0 df=15 of=0 ri=0 ql=0 b=10 + 5 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=3887/1 dn=0 df=15 of=0 ri=0 ql=0 b=10 + 6 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=859/1 dn=0 df=15 of=0 ri=0 ql=0 b=10 + 7 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=3761/1 dn=0 df=15 of=0 ri=0 ql=0 b=10 The first section lists the rcu_data structures for rcu, the second for rcu_bh. Each section has one line per CPU, or eight for this 8-CPU system. @@ -253,12 +254,6 @@ o "pqc" indicates which grace period the last-observed quiescent o "qp" indicates that RCU still expects a quiescent state from this CPU. -o "rpfq" is the number of rcu_pending() calls on this CPU required - to induce this CPU to invoke force_quiescent_state(). - -o "rp" is low-order four hex digits of the count of how many times - rcu_pending() has been invoked on this CPU. - o "dt" is the current value of the dyntick counter that is incremented when entering or leaving dynticks idle state, either by the scheduler or by irq. The number after the "/" is the interrupt @@ -305,6 +300,9 @@ o "b" is the batch limit for this CPU. If more than this number of RCU callbacks is ready to invoke, then the remainder will be deferred. +There is also an rcu/rcudata.csv file with the same information in +comma-separated-variable spreadsheet format. + The output of "cat rcu/rcugp" looks as follows: @@ -411,3 +409,63 @@ o Each element of the form "1/1 0:127 ^0" represents one struct For example, the first entry at the lowest level shows "^0", indicating that it corresponds to bit zero in the first entry at the middle level. + + +The output of "cat rcu/rcu_pending" looks as follows: + +rcu: + 0 np=255892 qsp=53936 cbr=0 cng=14417 gpc=10033 gps=24320 nf=6445 nn=146741 + 1 np=261224 qsp=54638 cbr=0 cng=25723 gpc=16310 gps=2849 nf=5912 nn=155792 + 2 np=237496 qsp=49664 cbr=0 cng=2762 gpc=45478 gps=1762 nf=1201 nn=136629 + 3 np=236249 qsp=48766 cbr=0 cng=286 gpc=48049 gps=1218 nf=207 nn=137723 + 4 np=221310 qsp=46850 cbr=0 cng=26 gpc=43161 gps=4634 nf=3529 nn=123110 + 5 np=237332 qsp=48449 cbr=0 cng=54 gpc=47920 gps=3252 nf=201 nn=137456 + 6 np=219995 qsp=46718 cbr=0 cng=50 gpc=42098 gps=6093 nf=4202 nn=120834 + 7 np=249893 qsp=49390 cbr=0 cng=72 gpc=38400 gps=17102 nf=41 nn=144888 +rcu_bh: + 0 np=146741 qsp=1419 cbr=0 cng=6 gpc=0 gps=0 nf=2 nn=145314 + 1 np=155792 qsp=12597 cbr=0 cng=0 gpc=4 gps=8 nf=3 nn=143180 + 2 np=136629 qsp=18680 cbr=0 cng=0 gpc=7 gps=6 nf=0 nn=117936 + 3 np=137723 qsp=2843 cbr=0 cng=0 gpc=10 gps=7 nf=0 nn=134863 + 4 np=123110 qsp=12433 cbr=0 cng=0 gpc=4 gps=2 nf=0 nn=110671 + 5 np=137456 qsp=4210 cbr=0 cng=0 gpc=6 gps=5 nf=0 nn=133235 + 6 np=120834 qsp=9902 cbr=0 cng=0 gpc=6 gps=3 nf=2 nn=110921 + 7 np=144888 qsp=26336 cbr=0 cng=0 gpc=8 gps=2 nf=0 nn=118542 + +As always, this is once again split into "rcu" and "rcu_bh" portions. +The fields are as follows: + +o "np" is the number of times that __rcu_pending() has been invoked + for the corresponding flavor of RCU. + +o "qsp" is the number of times that the RCU was waiting for a + quiescent state from this CPU. + +o "cbr" is the number of times that this CPU had RCU callbacks + that had passed through a grace period, and were thus ready + to be invoked. + +o "cng" is the number of times that this CPU needed another + grace period while RCU was idle. + +o "gpc" is the number of times that an old grace period had + completed, but this CPU was not yet aware of it. + +o "gps" is the number of times that a new grace period had started, + but this CPU was not yet aware of it. + +o "nf" is the number of times that this CPU suspected that the + current grace period had run for too long, and thus needed to + be forced. + + Please note that "forcing" consists of sending resched IPIs + to holdout CPUs. If that CPU really still is in an old RCU + read-side critical section, then we really do have to wait for it. + The assumption behing "forcing" is that the CPU is not still in + an old RCU read-side critical section, but has not yet responded + for some other reason. + +o "nn" is the number of times that this CPU needed nothing. Alert + readers will note that the rcu "nn" number for a given CPU very + closely matches the rcu_bh "np" number for that same CPU. This + is due to short-circuit evaluation in rcu_pending(). -- cgit v1.2.3 From 50fa610a3b6ba7cf91d7a92229177dfaff2b81a1 Mon Sep 17 00:00:00 2001 From: David Howells Date: Tue, 28 Apr 2009 15:01:38 +0100 Subject: sched: Document memory barriers implied by sleep/wake-up primitives Add a section to the memory barriers document to note the implied memory barriers of sleep primitives (set_current_state() and wrappers) and wake-up primitives (wake_up() and co.). Also extend the in-code comments on the wake_up() functions to note these implied barriers. [ Impact: add documentation ] Signed-off-by: David Howells Cc: Oleg Nesterov Cc: Linus Torvalds Cc: Andrew Morton LKML-Reference: <20090428140138.1192.94723.stgit@warthog.procyon.org.uk> Signed-off-by: Ingo Molnar --- Documentation/memory-barriers.txt | 129 +++++++++++++++++++++++++++++++++++++- kernel/sched.c | 23 +++++++ 2 files changed, 151 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index f5b7127f54ac..7f5809eddee6 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -31,6 +31,7 @@ Contents: - Locking functions. - Interrupt disabling functions. + - Sleep and wake-up functions. - Miscellaneous functions. (*) Inter-CPU locking barrier effects. @@ -1217,6 +1218,132 @@ barriers are required in such a situation, they must be provided from some other means. +SLEEP AND WAKE-UP FUNCTIONS +--------------------------- + +Sleeping and waking on an event flagged in global data can be viewed as an +interaction between two pieces of data: the task state of the task waiting for +the event and the global data used to indicate the event. To make sure that +these appear to happen in the right order, the primitives to begin the process +of going to sleep, and the primitives to initiate a wake up imply certain +barriers. + +Firstly, the sleeper normally follows something like this sequence of events: + + for (;;) { + set_current_state(TASK_UNINTERRUPTIBLE); + if (event_indicated) + break; + schedule(); + } + +A general memory barrier is interpolated automatically by set_current_state() +after it has altered the task state: + + CPU 1 + =============================== + set_current_state(); + set_mb(); + STORE current->state + + LOAD event_indicated + +set_current_state() may be wrapped by: + + prepare_to_wait(); + prepare_to_wait_exclusive(); + +which therefore also imply a general memory barrier after setting the state. +The whole sequence above is available in various canned forms, all of which +interpolate the memory barrier in the right place: + + wait_event(); + wait_event_interruptible(); + wait_event_interruptible_exclusive(); + wait_event_interruptible_timeout(); + wait_event_killable(); + wait_event_timeout(); + wait_on_bit(); + wait_on_bit_lock(); + + +Secondly, code that performs a wake up normally follows something like this: + + event_indicated = 1; + wake_up(&event_wait_queue); + +or: + + event_indicated = 1; + wake_up_process(event_daemon); + +A write memory barrier is implied by wake_up() and co. if and only if they wake +something up. The barrier occurs before the task state is cleared, and so sits +between the STORE to indicate the event and the STORE to set TASK_RUNNING: + + CPU 1 CPU 2 + =============================== =============================== + set_current_state(); STORE event_indicated + set_mb(); wake_up(); + STORE current->state + STORE current->state + LOAD event_indicated + +The available waker functions include: + + complete(); + wake_up(); + wake_up_all(); + wake_up_bit(); + wake_up_interruptible(); + wake_up_interruptible_all(); + wake_up_interruptible_nr(); + wake_up_interruptible_poll(); + wake_up_interruptible_sync(); + wake_up_interruptible_sync_poll(); + wake_up_locked(); + wake_up_locked_poll(); + wake_up_nr(); + wake_up_poll(); + wake_up_process(); + + +[!] Note that the memory barriers implied by the sleeper and the waker do _not_ +order multiple stores before the wake-up with respect to loads of those stored +values after the sleeper has called set_current_state(). For instance, if the +sleeper does: + + set_current_state(TASK_INTERRUPTIBLE); + if (event_indicated) + break; + __set_current_state(TASK_RUNNING); + do_something(my_data); + +and the waker does: + + my_data = value; + event_indicated = 1; + wake_up(&event_wait_queue); + +there's no guarantee that the change to event_indicated will be perceived by +the sleeper as coming after the change to my_data. In such a circumstance, the +code on both sides must interpolate its own memory barriers between the +separate data accesses. Thus the above sleeper ought to do: + + set_current_state(TASK_INTERRUPTIBLE); + if (event_indicated) { + smp_rmb(); + do_something(my_data); + } + +and the waker should do: + + my_data = value; + smp_wmb(); + event_indicated = 1; + wake_up(&event_wait_queue); + + MISCELLANEOUS FUNCTIONS ----------------------- @@ -1366,7 +1493,7 @@ WHERE ARE MEMORY BARRIERS NEEDED? Under normal operation, memory operation reordering is generally not going to be a problem as a single-threaded linear piece of code will still appear to -work correctly, even if it's in an SMP kernel. There are, however, three +work correctly, even if it's in an SMP kernel. There are, however, four circumstances in which reordering definitely _could_ be a problem: (*) Interprocessor interaction. diff --git a/kernel/sched.c b/kernel/sched.c index b902e587a3a0..fd0c2cee3f35 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -2458,6 +2458,17 @@ out: return success; } +/** + * wake_up_process - Wake up a specific process + * @p: The process to be woken up. + * + * Attempt to wake up the nominated process and move it to the set of runnable + * processes. Returns 1 if the process was woken up, 0 if it was already + * running. + * + * It may be assumed that this function implies a write memory barrier before + * changing the task state if and only if any tasks are woken up. + */ int wake_up_process(struct task_struct *p) { return try_to_wake_up(p, TASK_ALL, 0); @@ -5241,6 +5252,9 @@ void __wake_up_common(wait_queue_head_t *q, unsigned int mode, * @mode: which threads * @nr_exclusive: how many wake-one or wake-many threads to wake up * @key: is directly passed to the wakeup function + * + * It may be assumed that this function implies a write memory barrier before + * changing the task state if and only if any tasks are woken up. */ void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, void *key) @@ -5279,6 +5293,9 @@ void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key) * with each other. This can prevent needless bouncing between CPUs. * * On UP it can prevent extra preemption. + * + * It may be assumed that this function implies a write memory barrier before + * changing the task state if and only if any tasks are woken up. */ void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, void *key) @@ -5315,6 +5332,9 @@ EXPORT_SYMBOL_GPL(__wake_up_sync); /* For internal use only */ * awakened in the same order in which they were queued. * * See also complete_all(), wait_for_completion() and related routines. + * + * It may be assumed that this function implies a write memory barrier before + * changing the task state if and only if any tasks are woken up. */ void complete(struct completion *x) { @@ -5332,6 +5352,9 @@ EXPORT_SYMBOL(complete); * @x: holds the state of this particular completion * * This will wake up all threads waiting on this particular completion event. + * + * It may be assumed that this function implies a write memory barrier before + * changing the task state if and only if any tasks are woken up. */ void complete_all(struct completion *x) { -- cgit v1.2.3 From a76f8c6da1e48fd4ef025f42c736389532ff30ba Mon Sep 17 00:00:00 2001 From: Jason Baron Date: Thu, 30 Apr 2009 13:29:42 -0400 Subject: tracing: add new tracepoints docbook Add tracepoint docbook. This will help us document and understand what tracepoints are in the kernel. Since there are multiple macros, and files that contain tracepoints. [ Impact: add documentation ] Signed-off-by: Jason Baron Acked-by: Randy Dunlap Cc: akpm@linux-foundation.org Cc: rostedt@goodmis.org Cc: fweisbec@gmail.com Cc: mathieu.desnoyers@polymtl.ca Cc: wcohen@redhat.com LKML-Reference: <84160b6bd94aff02455da7e12bad054d34c579a0.1241107197.git.jbaron@redhat.com> Signed-off-by: Ingo Molnar --- Documentation/DocBook/Makefile | 3 +- Documentation/DocBook/tracepoint.tmpl | 84 +++++++++++++++++++++++++++++++++++ 2 files changed, 86 insertions(+), 1 deletion(-) create mode 100644 Documentation/DocBook/tracepoint.tmpl (limited to 'Documentation') diff --git a/Documentation/DocBook/Makefile b/Documentation/DocBook/Makefile index 8918a32c6b3a..4c8f4d6e114a 100644 --- a/Documentation/DocBook/Makefile +++ b/Documentation/DocBook/Makefile @@ -13,7 +13,8 @@ DOCBOOKS := z8530book.xml mcabook.xml device-drivers.xml \ gadget.xml libata.xml mtdnand.xml librs.xml rapidio.xml \ genericirq.xml s390-drivers.xml uio-howto.xml scsi.xml \ mac80211.xml debugobjects.xml sh.xml regulator.xml \ - alsa-driver-api.xml writing-an-alsa-driver.xml + alsa-driver-api.xml writing-an-alsa-driver.xml \ + tracepoint.xml ### # The build process is as follows (targets): diff --git a/Documentation/DocBook/tracepoint.tmpl b/Documentation/DocBook/tracepoint.tmpl new file mode 100644 index 000000000000..70891bc68491 --- /dev/null +++ b/Documentation/DocBook/tracepoint.tmpl @@ -0,0 +1,84 @@ + + + + + + The Linux Kernel Tracepoint API + + + + Jason + Baron + +
+ jbaron@redhat.com +
+
+
+
+ + + + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; either + version 2 of the License, or (at your option) any later + version. + + + + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + + + + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + + + + For more details see the file COPYING in the source + distribution of Linux. + + +
+ + + + Introduction + + Tracepoints are static probe points that are located in strategic points + throughout the kernel. 'Probes' register/unregister with tracepoints + via a callback mechanism. The 'probes' are strictly typed functions that + are passed a unique set of parameters defined by each tracepoint. + + + + From this simple callback mechanism, 'probes' can be used to profile, debug, + and understand kernel behavior. There are a number of tools that provide a + framework for using 'probes'. These tools include Systemtap, ftrace, and + LTTng. + + + + Tracepoints are defined in a number of header files via various macros. Thus, + the purpose of this document is to provide a clear accounting of the available + tracepoints. The intention is to understand not only what tracepoints are + available but also to understand where future tracepoints might be added. + + + + The API presented has functions of the form: + trace_tracepointname(function parameters). These are the + tracepoints callbacks that are found throughout the code. Registering and + unregistering probes with these callback sites is covered in the + Documentation/trace/* directory. + + + +
-- cgit v1.2.3 From 9ee1983c9aa18f12388ef660d0c76a23dc112959 Mon Sep 17 00:00:00 2001 From: Jason Baron Date: Thu, 30 Apr 2009 13:29:47 -0400 Subject: tracing: add irq tracepoint documentation Document irqs for the newly created docbook. [ Impact: add documentation ] Signed-off-by: Jason Baron Acked-by: Randy Dunlap Cc: akpm@linux-foundation.org Cc: rostedt@goodmis.org Cc: fweisbec@gmail.com Cc: mathieu.desnoyers@polymtl.ca Cc: wcohen@redhat.com LKML-Reference: <73ff42be3420157667ec548e9b0e409c3cfad05f.1241107197.git.jbaron@redhat.com> Signed-off-by: Ingo Molnar --- Documentation/DocBook/tracepoint.tmpl | 5 ++++ include/trace/events/irq.h | 46 ++++++++++++++++++++++++++++++++--- 2 files changed, 47 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/DocBook/tracepoint.tmpl b/Documentation/DocBook/tracepoint.tmpl index 70891bc68491..b0756d0fd579 100644 --- a/Documentation/DocBook/tracepoint.tmpl +++ b/Documentation/DocBook/tracepoint.tmpl @@ -81,4 +81,9 @@ + + IRQ +!Iinclude/trace/events/irq.h + + diff --git a/include/trace/events/irq.h b/include/trace/events/irq.h index 768686467518..32a9f7ef432b 100644 --- a/include/trace/events/irq.h +++ b/include/trace/events/irq.h @@ -7,8 +7,16 @@ #undef TRACE_SYSTEM #define TRACE_SYSTEM irq -/* - * Tracepoint for entry of interrupt handler: +/** + * irq_handler_entry - called immediately before the irq action handler + * @irq: irq number + * @action: pointer to struct irqaction + * + * The struct irqaction pointed to by @action contains various + * information about the handler, including the device name, + * @action->name, and the device id, @action->dev_id. When used in + * conjunction with the irq_handler_exit tracepoint, we can figure + * out irq handler latencies. */ TRACE_EVENT(irq_handler_entry, @@ -29,8 +37,16 @@ TRACE_EVENT(irq_handler_entry, TP_printk("irq=%d handler=%s", __entry->irq, __get_str(name)) ); -/* - * Tracepoint for return of an interrupt handler: +/** + * irq_handler_exit - called immediately after the irq action handler returns + * @irq: irq number + * @action: pointer to struct irqaction + * @ret: return value + * + * If the @ret value is set to IRQ_HANDLED, then we know that the corresponding + * @action->handler scuccessully handled this irq. Otherwise, the irq might be + * a shared irq line, or the irq was not handled successfully. Can be used in + * conjunction with the irq_handler_entry to understand irq handler latencies. */ TRACE_EVENT(irq_handler_exit, @@ -52,6 +68,17 @@ TRACE_EVENT(irq_handler_exit, __entry->irq, __entry->ret ? "handled" : "unhandled") ); +/** + * softirq_entry - called immediately before the softirq handler + * @h: pointer to struct softirq_action + * @vec: pointer to first struct softirq_action in softirq_vec array + * + * The @h parameter, contains a pointer to the struct softirq_action + * which has a pointer to the action handler that is called. By subtracting + * the @vec pointer from the @h pointer, we can determine the softirq + * number. Also, when used in combination with the softirq_exit tracepoint + * we can determine the softirq latency. + */ TRACE_EVENT(softirq_entry, TP_PROTO(struct softirq_action *h, struct softirq_action *vec), @@ -71,6 +98,17 @@ TRACE_EVENT(softirq_entry, TP_printk("softirq=%d action=%s", __entry->vec, __get_str(name)) ); +/** + * softirq_exit - called immediately after the softirq handler returns + * @h: pointer to struct softirq_action + * @vec: pointer to first struct softirq_action in softirq_vec array + * + * The @h parameter contains a pointer to the struct softirq_action + * that has handled the softirq. By subtracting the @vec pointer from + * the @h pointer, we can determine the softirq number. Also, when used in + * combination with the softirq_entry tracepoint we can determine the softirq + * latency. + */ TRACE_EVENT(softirq_exit, TP_PROTO(struct softirq_action *h, struct softirq_action *vec), -- cgit v1.2.3 From 60aa605dfce2976e54fa76e805ab0f221372d4d9 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 5 May 2009 17:50:21 +0200 Subject: sched: rt: document the risk of small values in the bandwidth settings Thomas noted that we should disallow sysctl_sched_rt_runtime == 0 for (!RT_GROUP) since the root group always has some RT tasks in it. Further, update the documentation to inspire clue. [ Impact: exclude corner-case sysctl_sched_rt_runtime value ] Reported-by: Thomas Gleixner Signed-off-by: Peter Zijlstra LKML-Reference: <20090505155436.863098054@chello.nl> Signed-off-by: Ingo Molnar --- Documentation/scheduler/sched-rt-group.txt | 18 ++++++++++++++++++ kernel/sched.c | 7 +++++++ 2 files changed, 25 insertions(+) (limited to 'Documentation') diff --git a/Documentation/scheduler/sched-rt-group.txt b/Documentation/scheduler/sched-rt-group.txt index 5ba4d3fc625a..eb74b014a3f8 100644 --- a/Documentation/scheduler/sched-rt-group.txt +++ b/Documentation/scheduler/sched-rt-group.txt @@ -4,6 +4,7 @@ CONTENTS ======== +0. WARNING 1. Overview 1.1 The problem 1.2 The solution @@ -14,6 +15,23 @@ CONTENTS 3. Future plans +0. WARNING +========== + + Fiddling with these settings can result in an unstable system, the knobs are + root only and assumes root knows what he is doing. + +Most notable: + + * very small values in sched_rt_period_us can result in an unstable + system when the period is smaller than either the available hrtimer + resolution, or the time it takes to handle the budget refresh itself. + + * very small values in sched_rt_runtime_us can result in an unstable + system when the runtime is so small the system has difficulty making + forward progress (NOTE: the migration thread and kstopmachine both + are real-time processes). + 1. Overview =========== diff --git a/kernel/sched.c b/kernel/sched.c index 54d67b94f1a9..2a43a581ead3 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -9917,6 +9917,13 @@ static int sched_rt_global_constraints(void) if (sysctl_sched_rt_period <= 0) return -EINVAL; + /* + * There's always some RT tasks in the root group + * -- migration, kstopmachine etc.. + */ + if (sysctl_sched_rt_runtime == 0) + return -EBUSY; + spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags); for_each_possible_cpu(i) { struct rt_rq *rt_rq = &cpu_rq(i)->rt; -- cgit v1.2.3 From c898faf91b3ec6b0f6efa35831b3984fa3331db0 Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Tue, 5 May 2009 17:28:56 -0400 Subject: x86: 46 bit physical address support on 64 bits Extend the maximum addressable memory on x86-64 from 2^44 to 2^46 bytes. This requires some shuffling around of the vmalloc and virtual memmap memory areas, to keep them away from the direct mapping of up to 64TB of physical memory. This patch also introduces a guard hole between the vmalloc area and the virtual memory map space. There's really no good reason why we wouldn't have a guard hole there. [ Impact: future hardware enablement ] Signed-off-by: Rik van Riel LKML-Reference: <20090505172856.6820db22@cuia.bos.redhat.com> Signed-off-by: H. Peter Anvin --- Documentation/x86/x86_64/mm.txt | 9 +++++---- arch/x86/include/asm/page_64_types.h | 2 +- arch/x86/include/asm/pgtable_64_types.h | 8 ++++---- arch/x86/include/asm/sparsemem.h | 2 +- 4 files changed, 11 insertions(+), 10 deletions(-) (limited to 'Documentation') diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt index 29b52b14d0b4..539413235845 100644 --- a/Documentation/x86/x86_64/mm.txt +++ b/Documentation/x86/x86_64/mm.txt @@ -6,10 +6,11 @@ Virtual memory map with 4 level page tables: 0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm hole caused by [48:63] sign extension ffff800000000000 - ffff80ffffffffff (=40 bits) guard hole -ffff880000000000 - ffffc0ffffffffff (=57 TB) direct mapping of all phys. memory -ffffc10000000000 - ffffc1ffffffffff (=40 bits) hole -ffffc20000000000 - ffffe1ffffffffff (=45 bits) vmalloc/ioremap space -ffffe20000000000 - ffffe2ffffffffff (=40 bits) virtual memory map (1TB) +ffff880000000000 - ffffc8ffffffffff (=64 TB) direct mapping of all phys. memory +ffffc80000000000 - ffffc8ffffffffff (=40 bits) hole +ffffc90000000000 - ffffe8ffffffffff (=45 bits) vmalloc/ioremap space +ffffe90000000000 - ffffe9ffffffffff (=40 bits) hole +ffffea0000000000 - ffffeaffffffffff (=40 bits) virtual memory map (1TB) ... unused hole ... ffffffff80000000 - ffffffffa0000000 (=512 MB) kernel text mapping, from phys 0 ffffffffa0000000 - fffffffffff00000 (=1536 MB) module mapping space diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h index 3f587188ae68..6fadb020bd2b 100644 --- a/arch/x86/include/asm/page_64_types.h +++ b/arch/x86/include/asm/page_64_types.h @@ -47,7 +47,7 @@ #define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START) #define __START_KERNEL_map _AC(0xffffffff80000000, UL) -/* See Documentation/x86_64/mm.txt for a description of the memory map. */ +/* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */ #define __PHYSICAL_MASK_SHIFT 46 #define __VIRTUAL_MASK_SHIFT 48 diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h index fbf42b8e0383..766ea16fbbbd 100644 --- a/arch/x86/include/asm/pgtable_64_types.h +++ b/arch/x86/include/asm/pgtable_64_types.h @@ -51,11 +51,11 @@ typedef struct { pteval_t pte; } pte_t; #define PGDIR_SIZE (_AC(1, UL) << PGDIR_SHIFT) #define PGDIR_MASK (~(PGDIR_SIZE - 1)) - +/* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */ #define MAXMEM _AC(__AC(1, UL) << MAX_PHYSMEM_BITS, UL) -#define VMALLOC_START _AC(0xffffc20000000000, UL) -#define VMALLOC_END _AC(0xffffe1ffffffffff, UL) -#define VMEMMAP_START _AC(0xffffe20000000000, UL) +#define VMALLOC_START _AC(0xffffc90000000000, UL) +#define VMALLOC_END _AC(0xffffe8ffffffffff, UL) +#define VMEMMAP_START _AC(0xffffea0000000000, UL) #define MODULES_VADDR _AC(0xffffffffa0000000, UL) #define MODULES_END _AC(0xffffffffff000000, UL) #define MODULES_LEN (MODULES_END - MODULES_VADDR) diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h index e3cc3c063ec5..4517d6b93188 100644 --- a/arch/x86/include/asm/sparsemem.h +++ b/arch/x86/include/asm/sparsemem.h @@ -27,7 +27,7 @@ #else /* CONFIG_X86_32 */ # define SECTION_SIZE_BITS 27 /* matt - 128 is convenient right now */ # define MAX_PHYSADDR_BITS 44 -# define MAX_PHYSMEM_BITS 44 /* Can be max 45 bits */ +# define MAX_PHYSMEM_BITS 46 #endif #endif /* CONFIG_SPARSEMEM */ -- cgit v1.2.3 From 2feceeff1e771850e49f9074307f071964fd9e3e Mon Sep 17 00:00:00 2001 From: "H. Peter Anvin" Date: Tue, 5 May 2009 19:07:07 -0700 Subject: x86: fix typo in address space documentation Fix a trivial typo in Documentation/x86/x86_64/mm.txt. [ Impact: documentation only ] Signed-off-by: H. Peter Anvin Cc: Rik van Riel --- Documentation/x86/x86_64/mm.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt index 539413235845..d6498e3cd713 100644 --- a/Documentation/x86/x86_64/mm.txt +++ b/Documentation/x86/x86_64/mm.txt @@ -6,7 +6,7 @@ Virtual memory map with 4 level page tables: 0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm hole caused by [48:63] sign extension ffff800000000000 - ffff80ffffffffff (=40 bits) guard hole -ffff880000000000 - ffffc8ffffffffff (=64 TB) direct mapping of all phys. memory +ffff880000000000 - ffffc7ffffffffff (=64 TB) direct mapping of all phys. memory ffffc80000000000 - ffffc8ffffffffff (=40 bits) hole ffffc90000000000 - ffffe8ffffffffff (=45 bits) vmalloc/ioremap space ffffe90000000000 - ffffe9ffffffffff (=40 bits) hole -- cgit v1.2.3 From 1dcdb5a9e7c235e6e80f1f4d5b8247b3e5347e48 Mon Sep 17 00:00:00 2001 From: Andi Kleen Date: Mon, 27 Apr 2009 17:44:11 +0200 Subject: oprofile: re-add force_arch_perfmon option This re-adds the force_arch_perfmon option that was in the original arch perfmon patchkit. Originally this was rejected in favour of a generalized perfmon=name option, but it turned out implementing the later in a reliable way is hard (and it would have been easy to crash the kernel if a user gets it wrong) But now Atom and Core i7 support being readded a user would need to update their oprofile userland to beyond 0.9.4 to use oprofile again on Atom or Core i7. To avoid this problem readd the force_arch_perfmon option. Signed-off-by: Andi Kleen Signed-off-by: Robert Richter --- Documentation/kernel-parameters.txt | 6 ++++++ arch/x86/oprofile/nmi_int.c | 6 ++++++ 2 files changed, 12 insertions(+) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 90b3924071b6..9b9566bf3301 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1650,6 +1650,12 @@ and is between 256 and 4096 characters. It is defined in the file oprofile.timer= [HW] Use timer interrupt instead of performance counters + oprofile.force_arch_perfmon=1 [X86] + Force use of architectural perfmon instead of + the CPU specific event set. + This might be useful if you have older oprofile + userland or if you want common events over Intel CPUs. + osst= [HW,SCSI] SCSI Tape Driver Format: , See also Documentation/scsi/st.txt. diff --git a/arch/x86/oprofile/nmi_int.c b/arch/x86/oprofile/nmi_int.c index 202864ad49a7..e5171c99e157 100644 --- a/arch/x86/oprofile/nmi_int.c +++ b/arch/x86/oprofile/nmi_int.c @@ -389,10 +389,16 @@ static int __init p4_init(char **cpu_type) return 0; } +int force_arch_perfmon; +module_param(force_arch_perfmon, int, 0); + static int __init ppro_init(char **cpu_type) { __u8 cpu_model = boot_cpu_data.x86_model; + if (force_arch_perfmon && cpu_has_arch_perfmon) + return 0; + switch (cpu_model) { case 0 ... 2: *cpu_type = "i386/ppro"; -- cgit v1.2.3 From 7e4e0bd50e80df2fe5501f48f872448376cdd997 Mon Sep 17 00:00:00 2001 From: Robert Richter Date: Wed, 6 May 2009 12:10:23 +0200 Subject: oprofile: introduce module_param oprofile.cpu_type This patch removes module_param oprofile.force_arch_perfmon and introduces oprofile.cpu_type=archperfmon instead. This new parameter can be reused for other models and architectures. Currently only archperfmon is supported. Cc: Andi Kleen Signed-off-by: Robert Richter --- Documentation/kernel-parameters.txt | 12 +++++++----- arch/x86/oprofile/nmi_int.c | 13 +++++++++++-- 2 files changed, 18 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 9b9566bf3301..6ce5f48859cc 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1650,11 +1650,13 @@ and is between 256 and 4096 characters. It is defined in the file oprofile.timer= [HW] Use timer interrupt instead of performance counters - oprofile.force_arch_perfmon=1 [X86] - Force use of architectural perfmon instead of - the CPU specific event set. - This might be useful if you have older oprofile - userland or if you want common events over Intel CPUs. + oprofile.cpu_type= Force an oprofile cpu type + This might be useful if you have an older oprofile + userland or if you want common events. + Format: { archperfmon } + archperfmon: [X86] Force use of architectural + perfmon on Intel CPUs instead of the + CPU specific event set. osst= [HW,SCSI] SCSI Tape Driver Format: , diff --git a/arch/x86/oprofile/nmi_int.c b/arch/x86/oprofile/nmi_int.c index 3308147182ae..3b285e656e27 100644 --- a/arch/x86/oprofile/nmi_int.c +++ b/arch/x86/oprofile/nmi_int.c @@ -386,8 +386,17 @@ static int __init p4_init(char **cpu_type) return 0; } -int force_arch_perfmon; -module_param(force_arch_perfmon, int, 0); +static int force_arch_perfmon; +static int force_cpu_type(const char *str, struct kernel_param *kp) +{ + if (!strcmp(str, "archperfmon")) { + force_arch_perfmon = 1; + printk(KERN_INFO "oprofile: forcing architectural perfmon\n"); + } + + return 0; +} +module_param_call(cpu_type, force_cpu_type, NULL, NULL, 0); static int __init ppro_init(char **cpu_type) { -- cgit v1.2.3 From b30505c81a9d4adea8b70ecff512b0216929b797 Mon Sep 17 00:00:00 2001 From: Darren Hart Date: Thu, 7 May 2009 15:40:14 -0700 Subject: futex: add requeue-pi documentation Add Documentation/futex-requeue-pi.txt describing the motivation for the newly added FUTEX_*REQUEUE_PI op codes and their implementation. [ Impact: add documentation ] Signed-off-by: Darren Hart Cc: Sripathi Kodi Cc: Peter Zijlstra Cc: John Stultz Cc: Steven Rostedt Cc: Dinakar Guniguntala Cc: Ulrich Drepper Cc: Eric Dumazet Cc: Jakub Jelinek LKML-Reference: <4A03634E.3080609@us.ibm.com> [ reformatted the file ] Signed-off-by: Ingo Molnar --- Documentation/futex-requeue-pi.txt | 131 +++++++++++++++++++++++++++++++++++++ 1 file changed, 131 insertions(+) create mode 100644 Documentation/futex-requeue-pi.txt (limited to 'Documentation') diff --git a/Documentation/futex-requeue-pi.txt b/Documentation/futex-requeue-pi.txt new file mode 100644 index 000000000000..9dc1ff4fd536 --- /dev/null +++ b/Documentation/futex-requeue-pi.txt @@ -0,0 +1,131 @@ +Futex Requeue PI +---------------- + +Requeueing of tasks from a non-PI futex to a PI futex requires +special handling in order to ensure the underlying rt_mutex is never +left without an owner if it has waiters; doing so would break the PI +boosting logic [see rt-mutex-desgin.txt] For the purposes of +brevity, this action will be referred to as "requeue_pi" throughout +this document. Priority inheritance is abbreviated throughout as +"PI". + +Motivation +---------- + +Without requeue_pi, the glibc implementation of +pthread_cond_broadcast() must resort to waking all the tasks waiting +on a pthread_condvar and letting them try to sort out which task +gets to run first in classic thundering-herd formation. An ideal +implementation would wake the highest-priority waiter, and leave the +rest to the natural wakeup inherent in unlocking the mutex +associated with the condvar. + +Consider the simplified glibc calls: + +/* caller must lock mutex */ +pthread_cond_wait(cond, mutex) +{ + lock(cond->__data.__lock); + unlock(mutex); + do { + unlock(cond->__data.__lock); + futex_wait(cond->__data.__futex); + lock(cond->__data.__lock); + } while(...) + unlock(cond->__data.__lock); + lock(mutex); +} + +pthread_cond_broadcast(cond) +{ + lock(cond->__data.__lock); + unlock(cond->__data.__lock); + futex_requeue(cond->data.__futex, cond->mutex); +} + +Once pthread_cond_broadcast() requeues the tasks, the cond->mutex +has waiters. Note that pthread_cond_wait() attempts to lock the +mutex only after it has returned to user space. This will leave the +underlying rt_mutex with waiters, and no owner, breaking the +previously mentioned PI-boosting algorithms. + +In order to support PI-aware pthread_condvar's, the kernel needs to +be able to requeue tasks to PI futexes. This support implies that +upon a successful futex_wait system call, the caller would return to +user space already holding the PI futex. The glibc implementation +would be modified as follows: + + +/* caller must lock mutex */ +pthread_cond_wait_pi(cond, mutex) +{ + lock(cond->__data.__lock); + unlock(mutex); + do { + unlock(cond->__data.__lock); + futex_wait_requeue_pi(cond->__data.__futex); + lock(cond->__data.__lock); + } while(...) + unlock(cond->__data.__lock); + /* the kernel acquired the the mutex for us */ +} + +pthread_cond_broadcast_pi(cond) +{ + lock(cond->__data.__lock); + unlock(cond->__data.__lock); + futex_requeue_pi(cond->data.__futex, cond->mutex); +} + +The actual glibc implementation will likely test for PI and make the +necessary changes inside the existing calls rather than creating new +calls for the PI cases. Similar changes are needed for +pthread_cond_timedwait() and pthread_cond_signal(). + +Implementation +-------------- + +In order to ensure the rt_mutex has an owner if it has waiters, it +is necessary for both the requeue code, as well as the waiting code, +to be able to acquire the rt_mutex before returning to user space. +The requeue code cannot simply wake the waiter and leave it to +acquire the rt_mutex as it would open a race window between the +requeue call returning to user space and the waiter waking and +starting to run. This is especially true in the uncontended case. + +The solution involves two new rt_mutex helper routines, +rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which +allow the requeue code to acquire an uncontended rt_mutex on behalf +of the waiter and to enqueue the waiter on a contended rt_mutex. +Two new system calls provide the kernel<->user interface to +requeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_REQUEUE_CMP_PI. + +FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait() +and pthread_cond_timedwait()) to block on the initial futex and wait +to be requeued to a PI-aware futex. The implementation is the +result of a high-speed collision between futex_wait() and +futex_lock_pi(), with some extra logic to check for the additional +wake-up scenarios. + +FUTEX_REQUEUE_CMP_PI is called by the waker +(pthread_cond_broadcast() and pthread_cond_signal()) to requeue and +possibly wake the waiting tasks. Internally, this system call is +still handled by futex_requeue (by passing requeue_pi=1). Before +requeueing, futex_requeue() attempts to acquire the requeue target +PI futex on behalf of the top waiter. If it can, this waiter is +woken. futex_requeue() then proceeds to requeue the remaining +nr_wake+nr_requeue tasks to the PI futex, calling +rt_mutex_start_proxy_lock() prior to each requeue to prepare the +task as a waiter on the underlying rt_mutex. It is possible that +the lock can be acquired at this stage as well, if so, the next +waiter is woken to finish the acquisition of the lock. + +FUTEX_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but +their sum is all that really matters. futex_requeue() will wake or +requeue up to nr_wake + nr_requeue tasks. It will wake only as many +tasks as it can acquire the lock for, which in the majority of cases +should be 0 as good programming practice dictates that the caller of +either pthread_cond_broadcast() or pthread_cond_signal() acquire the +mutex prior to making the call. FUTEX_REQUEUE_PI requires that +nr_wake=1. nr_requeue should be INT_MAX for broadcast and 0 for +signal. -- cgit v1.2.3 From d297366ba692faf1f0384811a6ff0b20c3470b1b Mon Sep 17 00:00:00 2001 From: "H. Peter Anvin" Date: Mon, 11 May 2009 16:06:23 -0700 Subject: x86: document new bzImage fields Document the new bzImage fields for kernel memory placement. [ Impact: adds documentation ] Signed-off-by: H. Peter Anvin --- Documentation/x86/boot.txt | 69 +++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 65 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/x86/boot.txt b/Documentation/x86/boot.txt index e0203662f9e9..cf8dfc70a118 100644 --- a/Documentation/x86/boot.txt +++ b/Documentation/x86/boot.txt @@ -50,6 +50,11 @@ Protocol 2.08: (Kernel 2.6.26) Added crc32 checksum and ELF format Protocol 2.09: (Kernel 2.6.26) Added a field of 64-bit physical pointer to single linked list of struct setup_data. +Protocol 2.10: (Kernel 2.6.31) A protocol for relaxed alignment + beyond the kernel_alignment added, new init_size and + pref_address fields. + + **** MEMORY LAYOUT The traditional memory map for the kernel loader, used for Image or @@ -173,7 +178,7 @@ Offset Proto Name Meaning 022C/4 2.03+ ramdisk_max Highest legal initrd address 0230/4 2.05+ kernel_alignment Physical addr alignment required for kernel 0234/1 2.05+ relocatable_kernel Whether kernel is relocatable or not -0235/1 N/A pad2 Unused +0235/1 2.10+ min_alignment Minimum alignment, as a power of two 0236/2 N/A pad3 Unused 0238/4 2.06+ cmdline_size Maximum size of the kernel command line 023C/4 2.07+ hardware_subarch Hardware subarchitecture @@ -182,6 +187,8 @@ Offset Proto Name Meaning 024C/4 2.08+ payload_length Length of kernel payload 0250/8 2.09+ setup_data 64-bit physical pointer to linked list of struct setup_data +0258/8 2.10+ pref_address Preferred loading address +0260/4 2.10+ init_size Linear memory required during initialization (1) For backwards compatibility, if the setup_sects field contains 0, the real value is 4. @@ -482,11 +489,19 @@ Protocol: 2.03+ 0x37FFFFFF, you can start your ramdisk at 0x37FE0000.) Field name: kernel_alignment -Type: read (reloc) +Type: read/modify (reloc) Offset/size: 0x230/4 -Protocol: 2.05+ +Protocol: 2.05+ (read), 2.10+ (modify) + + Alignment unit required by the kernel (if relocatable_kernel is + true.) A relocatable kernel that is loaded at an alignment + incompatible with the value in this field will be realigned during + kernel initialization. - Alignment unit required by the kernel (if relocatable_kernel is true.) + Starting with protocol version 2.10, this reflects the kernel + alignment preferred for optimal performance; it is possible for the + loader to modify this field to permit a lesser alignment. See the + min_alignment and pref_address field below. Field name: relocatable_kernel Type: read (reloc) @@ -498,6 +513,22 @@ Protocol: 2.05+ After loading, the boot loader must set the code32_start field to point to the loaded code, or to a boot loader hook. +Field name: min_alignment +Type: read (reloc) +Offset/size: 0x235/1 +Protocol: 2.10+ + + This field, if nonzero, indicates as a power of two the minimum + alignment required, as opposed to preferred, by the kernel to boot. + If a boot loader makes use of this field, it should update the + kernel_alignment field with the alignment unit desired; typically: + + kernel_alignment = 1 << min_alignment + + There may be a considerable performance cost with an excessively + misaligned kernel. Therefore, a loader should typically try each + power-of-two alignment from kernel_alignment down to this alignment. + Field name: cmdline_size Type: read Offset/size: 0x238/4 @@ -582,6 +613,36 @@ Protocol: 2.09+ sure to consider the case where the linked list already contains entries. +Field name: pref_address +Type: read (reloc) +Offset/size: 0x258/8 +Protocol: 2.10+ + + This field, if nonzero, represents a preferred load address for the + kernel. A relocating bootloader should attempt to load at this + address if possible. + + A non-relocatable kernel will unconditionally move itself and to run + at this address. + +Field name: init_size +Type: read +Offset/size: 0x25c/4 + + This field indicates the amount of linear contiguous memory starting + at the kernel runtime start address that the kernel needs before it + is capable of examining its memory map. This is not the same thing + as the total amount of memory the kernel needs to boot, but it can + be used by a relocating boot loader to help select a safe load + address for the kernel. + + The kernel runtime start address is determined by the following algorithm: + + if (relocatable_kernel) + runtime_start = align_up(load_address, kernel_alignment) + else + runtime_start = pref_address + **** THE IMAGE CHECKSUM -- cgit v1.2.3 From 5031296c57024a78ddad4edfc993367dbf4abb98 Mon Sep 17 00:00:00 2001 From: "H. Peter Anvin" Date: Thu, 7 May 2009 16:54:11 -0700 Subject: x86: add extension fields for bootloader type and version A long ago, in days of yore, it all began with a god named Thor. There were vikings and boats and some plans for a Linux kernel header. Unfortunately, a single 8-bit field was used for bootloader type and version. This has generally worked without *too* much pain, but we're getting close to flat running out of ID fields. Add extension fields for both type and version. The type will be extended if it the old field is 0xE; the version is a simple MSB extension. Keep /proc/sys/kernel/bootloader_type containing (type << 4) + (ver & 0xf) for backwards compatiblity, but also add /proc/sys/kernel/bootloader_version which contains the full version number. [ Impact: new feature to support more bootloaders ] Signed-off-by: H. Peter Anvin --- Documentation/x86/boot.txt | 59 +++++++++++++++++++++++++++++++++++----- arch/x86/boot/header.S | 6 +++- arch/x86/include/asm/bootparam.h | 3 +- arch/x86/include/asm/processor.h | 1 + arch/x86/kernel/setup.c | 10 +++++-- kernel/sysctl.c | 8 ++++++ 6 files changed, 76 insertions(+), 11 deletions(-) (limited to 'Documentation') diff --git a/Documentation/x86/boot.txt b/Documentation/x86/boot.txt index cf8dfc70a118..8da3a795083f 100644 --- a/Documentation/x86/boot.txt +++ b/Documentation/x86/boot.txt @@ -50,10 +50,9 @@ Protocol 2.08: (Kernel 2.6.26) Added crc32 checksum and ELF format Protocol 2.09: (Kernel 2.6.26) Added a field of 64-bit physical pointer to single linked list of struct setup_data. -Protocol 2.10: (Kernel 2.6.31) A protocol for relaxed alignment +Protocol 2.10: (Kernel 2.6.31) Added a protocol for relaxed alignment beyond the kernel_alignment added, new init_size and - pref_address fields. - + pref_address fields. Added extended boot loader IDs. **** MEMORY LAYOUT @@ -173,7 +172,8 @@ Offset Proto Name Meaning 021C/4 2.00+ ramdisk_size initrd size (set by boot loader) 0220/4 2.00+ bootsect_kludge DO NOT USE - for bootsect.S use only 0224/2 2.01+ heap_end_ptr Free memory after setup end -0226/2 N/A pad1 Unused +0226/1 2.02+(3 ext_loader_ver Extended boot loader version +0227/1 2.02+(3 ext_loader_type Extended boot loader ID 0228/4 2.02+ cmd_line_ptr 32-bit pointer to the kernel command line 022C/4 2.03+ ramdisk_max Highest legal initrd address 0230/4 2.05+ kernel_alignment Physical addr alignment required for kernel @@ -197,6 +197,8 @@ Offset Proto Name Meaning field are unusable, which means the size of a bzImage kernel cannot be determined. +(3) Ignored, but safe to set, for boot protocols 2.02-2.09. + If the "HdrS" (0x53726448) magic number is not found at offset 0x202, the boot protocol version is "old". Loading an old kernel, the following parameters should be assumed: @@ -350,18 +352,32 @@ Protocol: 2.00+ 0xTV here, where T is an identifier for the boot loader and V is a version number. Otherwise, enter 0xFF here. + For boot loader IDs above T = 0xD, write T = 0xE to this field and + write the extended ID minus 0x10 to the ext_loader_type field. + Similarly, the ext_loader_ver field can be used to provide more than + four bits for the bootloader version. + + For example, for T = 0x15, V = 0x234, write: + + type_of_loader <- 0xE4 + ext_loader_type <- 0x05 + ext_loader_ver <- 0x23 + Assigned boot loader ids: 0 LILO (0x00 reserved for pre-2.00 bootloader) 1 Loadlin 2 bootsect-loader (0x20, all other values reserved) - 3 SYSLINUX - 4 EtherBoot + 3 Syslinux + 4 Etherboot/gPXE 5 ELILO 7 GRUB - 8 U-BOOT + 8 U-Boot 9 Xen A Gujin B Qemu + C Arcturus Networks uCbootloader + E Extended (see ext_loader_type) + F Special (0xFF = undefined) Please contact if you need a bootloader ID value assigned. @@ -460,6 +476,35 @@ Protocol: 2.01+ Set this field to the offset (from the beginning of the real-mode code) of the end of the setup stack/heap, minus 0x0200. +Field name: ext_loader_ver +Type: write (optional) +Offset/size: 0x226/1 +Protocol: 2.02+ + + This field is used as an extension of the version number in the + type_of_loader field. The total version number is considered to be + (type_of_loader & 0x0f) + (ext_loader_ver << 4). + + The use of this field is boot loader specific. If not written, it + is zero. + + Kernels prior to 2.6.31 did not recognize this field, but it is safe + to write for protocol version 2.02 or higher. + +Field name: ext_loader_type +Type: write (obligatory if (type_of_loader & 0xf0) == 0xe0) +Offset/size: 0x227/1 +Protocol: 2.02+ + + This field is used as an extension of the type number in + type_of_loader field. If the type in type_of_loader is 0xE, then + the actual type is (ext_loader_type + 0x10). + + This field is ignored if the type in type_of_loader is not 0xE. + + Kernels prior to 2.6.31 did not recognize this field, but it is safe + to write for protocol version 2.02 or higher. + Field name: cmd_line_ptr Type: write (obligatory) Offset/size: 0x228/4 diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S index a0b426978d55..68c3bfbaff24 100644 --- a/arch/x86/boot/header.S +++ b/arch/x86/boot/header.S @@ -169,7 +169,11 @@ heap_end_ptr: .word _end+STACK_SIZE-512 # end of setup code can be used by setup # for local heap purposes. -pad1: .word 0 +ext_loader_ver: + .byte 0 # Extended boot loader version +ext_loader_type: + .byte 0 # Extended boot loader type + cmd_line_ptr: .long 0 # (Header version 0x0202 or later) # If nonzero, a 32-bit pointer # to the kernel command line. diff --git a/arch/x86/include/asm/bootparam.h b/arch/x86/include/asm/bootparam.h index 433adaebf9b6..1724e8de317c 100644 --- a/arch/x86/include/asm/bootparam.h +++ b/arch/x86/include/asm/bootparam.h @@ -50,7 +50,8 @@ struct setup_header { __u32 ramdisk_size; __u32 bootsect_kludge; __u16 heap_end_ptr; - __u16 _pad1; + __u8 ext_loader_ver; + __u8 ext_loader_type; __u32 cmd_line_ptr; __u32 initrd_addr_max; __u32 kernel_alignment; diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index fcf4d92e7e04..6384d25121ca 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -814,6 +814,7 @@ extern unsigned int BIOS_revision; /* Boot loader type from the setup header: */ extern int bootloader_type; +extern int bootloader_version; extern char ignore_fpu_irq; diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index b4158439bf63..2b093451aec9 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -214,8 +214,8 @@ unsigned long mmu_cr4_features; unsigned long mmu_cr4_features = X86_CR4_PAE; #endif -/* Boot loader ID as an integer, for the benefit of proc_dointvec */ -int bootloader_type; +/* Boot loader ID and version as integers, for the benefit of proc_dointvec */ +int bootloader_type, bootloader_version; /* * Setup options @@ -706,6 +706,12 @@ void __init setup_arch(char **cmdline_p) #endif saved_video_mode = boot_params.hdr.vid_mode; bootloader_type = boot_params.hdr.type_of_loader; + if ((bootloader_type >> 4) == 0xe) { + bootloader_type &= 0xf; + bootloader_type |= (boot_params.hdr.ext_loader_type+0x10) << 4; + } + bootloader_version = bootloader_type & 0xf; + bootloader_version |= boot_params.hdr.ext_loader_ver << 4; #ifdef CONFIG_BLK_DEV_RAM rd_image_start = boot_params.hdr.ram_size & RAMDISK_IMAGE_START_MASK; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index e3d2c7dd59b9..cf91c9317b26 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -727,6 +727,14 @@ static struct ctl_table kern_table[] = { .mode = 0444, .proc_handler = &proc_dointvec, }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "bootloader_version", + .data = &bootloader_version, + .maxlen = sizeof (int), + .mode = 0444, + .proc_handler = &proc_dointvec, + }, { .ctl_name = CTL_UNNUMBERED, .procname = "kstack_depth_to_print", -- cgit v1.2.3 From 888a589f6be07d624e21e2174d98375e9f95911b Mon Sep 17 00:00:00 2001 From: Yinghai Lu Date: Fri, 15 May 2009 13:59:37 -0700 Subject: mm, x86: remove MEMORY_HOTPLUG_RESERVE related code after: | commit b263295dbffd33b0fbff670720fa178c30e3392a | Author: Christoph Lameter | Date: Wed Jan 30 13:30:47 2008 +0100 | | x86: 64-bit, make sparsemem vmemmap the only memory model we don't have MEMORY_HOTPLUG_RESERVE anymore. Historically, x86-64 had an architecture-specific method for memory hotplug whereby it scanned the SRAT for physical memory ranges that could be potentially used for memory hot-add later. By reserving those ranges without physical memory, the memmap would be allocated and left dormant until needed. This depended on the DISCONTIG memory model which has been removed so the code implementing HOTPLUG_RESERVE is now dead. This patch removes the dead code used by MEMORY_HOTPLUG_RESERVE. (Changelog authored by Mel.) v2: updated changelog, and remove hotadd= in doc [ Impact: remove dead code ] Signed-off-by: Yinghai Lu Reviewed-by: Christoph Lameter Reviewed-by: Mel Gorman Workflow-found-OK-by: Andrew Morton LKML-Reference: <4A0C4910.7090508@kernel.org> Signed-off-by: Ingo Molnar --- Documentation/x86/x86_64/boot-options.txt | 5 --- arch/x86/include/asm/numa_64.h | 3 -- arch/x86/mm/numa_64.c | 5 --- arch/x86/mm/srat_64.c | 63 ++++++---------------------- include/linux/mm.h | 2 - mm/page_alloc.c | 69 ------------------------------- 6 files changed, 12 insertions(+), 135 deletions(-) (limited to 'Documentation') diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt index 34c13040a718..2db5893d6c97 100644 --- a/Documentation/x86/x86_64/boot-options.txt +++ b/Documentation/x86/x86_64/boot-options.txt @@ -150,11 +150,6 @@ NUMA Otherwise, the remaining system RAM is allocated to an additional node. - numa=hotadd=percent - Only allow hotadd memory to preallocate page structures upto - percent of already available memory. - numa=hotadd=0 will disable hotadd memory. - ACPI acpi=off Don't enable ACPI diff --git a/arch/x86/include/asm/numa_64.h b/arch/x86/include/asm/numa_64.h index 064ed6df4cbe..7feff0648d74 100644 --- a/arch/x86/include/asm/numa_64.h +++ b/arch/x86/include/asm/numa_64.h @@ -17,9 +17,6 @@ extern int compute_hash_shift(struct bootnode *nodes, int numblks, extern void numa_init_array(void); extern int numa_off; -extern void srat_reserve_add_area(int nodeid); -extern int hotadd_percent; - extern s16 apicid_to_node[MAX_LOCAL_APIC]; extern unsigned long numa_free_all_bootmem(void); diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c index fb61d81a656f..a6a93c395231 100644 --- a/arch/x86/mm/numa_64.c +++ b/arch/x86/mm/numa_64.c @@ -272,9 +272,6 @@ void __init setup_node_bootmem(int nodeid, unsigned long start, reserve_bootmem_node(NODE_DATA(nodeid), bootmap_start, bootmap_pages<> PAGE_SHIFT; unsigned long e_pfn = end >> PAGE_SHIFT; - int ret = 0, changed = 0; + int changed = 0; struct bootnode *nd = &nodes_add[node]; /* I had some trouble with strange memory hotadd regions breaking @@ -210,7 +201,7 @@ reserve_hotadd(int node, unsigned long start, unsigned long end) mistakes */ if ((signed long)(end - start) < NODE_MIN_SIZE) { printk(KERN_ERR "SRAT: Hotplug area too small\n"); - return -1; + return; } /* This check might be a bit too strict, but I'm keeping it for now. */ @@ -218,12 +209,7 @@ reserve_hotadd(int node, unsigned long start, unsigned long end) printk(KERN_ERR "SRAT: Hotplug area %lu -> %lu has existing memory\n", s_pfn, e_pfn); - return -1; - } - - if (!hotadd_enough_memory(&nodes_add[node])) { - printk(KERN_ERR "SRAT: Hotplug area too large\n"); - return -1; + return; } /* Looks good */ @@ -245,11 +231,9 @@ reserve_hotadd(int node, unsigned long start, unsigned long end) printk(KERN_ERR "SRAT: Hotplug zone not continuous. Partly ignored\n"); } - ret = update_end_of_memory(nd->end); - if (changed) - printk(KERN_INFO "SRAT: hot plug zone found %Lx - %Lx\n", nd->start, nd->end); - return ret; + printk(KERN_INFO "SRAT: hot plug zone found %Lx - %Lx\n", + nd->start, nd->end); } /* Callback for parsing of the Proximity Domain <-> Memory Area mappings */ @@ -310,13 +294,10 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma) start, end); e820_register_active_regions(node, start >> PAGE_SHIFT, end >> PAGE_SHIFT); - push_node_boundaries(node, nd->start >> PAGE_SHIFT, - nd->end >> PAGE_SHIFT); - if ((ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) && - (reserve_hotadd(node, start, end) < 0)) { - /* Ignore hotadd region. Undo damage */ - printk(KERN_NOTICE "SRAT: Hotplug region ignored\n"); + if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) { + update_nodes_add(node, start, end); + /* restore nodes[node] */ *nd = oldnode; if ((nd->start | nd->end) == 0) node_clear(node, nodes_parsed); @@ -510,26 +491,6 @@ static int null_slit_node_compare(int a, int b) } #endif /* CONFIG_NUMA_EMU */ -void __init srat_reserve_add_area(int nodeid) -{ - if (found_add_area && nodes_add[nodeid].end) { - u64 total_mb; - - printk(KERN_INFO "SRAT: Reserving hot-add memory space " - "for node %d at %Lx-%Lx\n", - nodeid, nodes_add[nodeid].start, nodes_add[nodeid].end); - total_mb = (nodes_add[nodeid].end - nodes_add[nodeid].start) - >> PAGE_SHIFT; - total_mb *= sizeof(struct page); - total_mb >>= 20; - printk(KERN_INFO "SRAT: This will cost you %Lu MB of " - "pre-allocated memory.\n", (unsigned long long)total_mb); - reserve_bootmem_node(NODE_DATA(nodeid), nodes_add[nodeid].start, - nodes_add[nodeid].end - nodes_add[nodeid].start, - BOOTMEM_DEFAULT); - } -} - int __node_distance(int a, int b) { int index; diff --git a/include/linux/mm.h b/include/linux/mm.h index bff1f0d475c7..511b09867096 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1031,8 +1031,6 @@ extern void add_active_range(unsigned int nid, unsigned long start_pfn, unsigned long end_pfn); extern void remove_active_range(unsigned int nid, unsigned long start_pfn, unsigned long end_pfn); -extern void push_node_boundaries(unsigned int nid, unsigned long start_pfn, - unsigned long end_pfn); extern void remove_all_active_ranges(void); extern unsigned long absent_pages_in_range(unsigned long start_pfn, unsigned long end_pfn); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fe753ecf2aa5..474c7e9dd51a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -149,10 +149,6 @@ static unsigned long __meminitdata dma_reserve; static int __meminitdata nr_nodemap_entries; static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES]; static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES]; -#ifdef CONFIG_MEMORY_HOTPLUG_RESERVE - static unsigned long __meminitdata node_boundary_start_pfn[MAX_NUMNODES]; - static unsigned long __meminitdata node_boundary_end_pfn[MAX_NUMNODES]; -#endif /* CONFIG_MEMORY_HOTPLUG_RESERVE */ static unsigned long __initdata required_kernelcore; static unsigned long __initdata required_movablecore; static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES]; @@ -3102,64 +3098,6 @@ void __init sparse_memory_present_with_active_regions(int nid) early_node_map[i].end_pfn); } -/** - * push_node_boundaries - Push node boundaries to at least the requested boundary - * @nid: The nid of the node to push the boundary for - * @start_pfn: The start pfn of the node - * @end_pfn: The end pfn of the node - * - * In reserve-based hot-add, mem_map is allocated that is unused until hotadd - * time. Specifically, on x86_64, SRAT will report ranges that can potentially - * be hotplugged even though no physical memory exists. This function allows - * an arch to push out the node boundaries so mem_map is allocated that can - * be used later. - */ -#ifdef CONFIG_MEMORY_HOTPLUG_RESERVE -void __init push_node_boundaries(unsigned int nid, - unsigned long start_pfn, unsigned long end_pfn) -{ - mminit_dprintk(MMINIT_TRACE, "zoneboundary", - "Entering push_node_boundaries(%u, %lu, %lu)\n", - nid, start_pfn, end_pfn); - - /* Initialise the boundary for this node if necessary */ - if (node_boundary_end_pfn[nid] == 0) - node_boundary_start_pfn[nid] = -1UL; - - /* Update the boundaries */ - if (node_boundary_start_pfn[nid] > start_pfn) - node_boundary_start_pfn[nid] = start_pfn; - if (node_boundary_end_pfn[nid] < end_pfn) - node_boundary_end_pfn[nid] = end_pfn; -} - -/* If necessary, push the node boundary out for reserve hotadd */ -static void __meminit account_node_boundary(unsigned int nid, - unsigned long *start_pfn, unsigned long *end_pfn) -{ - mminit_dprintk(MMINIT_TRACE, "zoneboundary", - "Entering account_node_boundary(%u, %lu, %lu)\n", - nid, *start_pfn, *end_pfn); - - /* Return if boundary information has not been provided */ - if (node_boundary_end_pfn[nid] == 0) - return; - - /* Check the boundaries and update if necessary */ - if (node_boundary_start_pfn[nid] < *start_pfn) - *start_pfn = node_boundary_start_pfn[nid]; - if (node_boundary_end_pfn[nid] > *end_pfn) - *end_pfn = node_boundary_end_pfn[nid]; -} -#else -void __init push_node_boundaries(unsigned int nid, - unsigned long start_pfn, unsigned long end_pfn) {} - -static void __meminit account_node_boundary(unsigned int nid, - unsigned long *start_pfn, unsigned long *end_pfn) {} -#endif - - /** * get_pfn_range_for_nid - Return the start and end page frames for a node * @nid: The nid to return the range for. If MAX_NUMNODES, the min and max PFN are returned. @@ -3185,9 +3123,6 @@ void __meminit get_pfn_range_for_nid(unsigned int nid, if (*start_pfn == -1UL) *start_pfn = 0; - - /* Push the node boundaries out if requested */ - account_node_boundary(nid, start_pfn, end_pfn); } /* @@ -3793,10 +3728,6 @@ void __init remove_all_active_ranges(void) { memset(early_node_map, 0, sizeof(early_node_map)); nr_nodemap_entries = 0; -#ifdef CONFIG_MEMORY_HOTPLUG_RESERVE - memset(node_boundary_start_pfn, 0, sizeof(node_boundary_start_pfn)); - memset(node_boundary_end_pfn, 0, sizeof(node_boundary_end_pfn)); -#endif /* CONFIG_MEMORY_HOTPLUG_RESERVE */ } /* Compare two active node_active_regions */ -- cgit v1.2.3 From 143c145e3a475065a4be661468d0df1bd0b25f74 Mon Sep 17 00:00:00 2001 From: Li Zefan Date: Tue, 19 May 2009 14:43:15 +0800 Subject: tracing/events: Documentation updates - fix some typos - document the difference between '>' and '>>' - document the 'enable' toggle - remove section "Defining an event-enabled tracepoint", since it's out-dated and sample/trace_events/ already serves this purpose. v2: add "Updated by Li Zefan" [ Impact: make documentation up-to-date ] Signed-off-by: Li Zefan Cc: Steven Rostedt Cc: Frederic Weisbecker Cc: "Theodore Ts'o" LKML-Reference: <4A125503.5060406@cn.fujitsu.com> Signed-off-by: Ingo Molnar --- Documentation/trace/events.txt | 159 +++++++++++++++-------------------------- 1 file changed, 57 insertions(+), 102 deletions(-) (limited to 'Documentation') diff --git a/Documentation/trace/events.txt b/Documentation/trace/events.txt index abdee664c0f6..f157d7594ea7 100644 --- a/Documentation/trace/events.txt +++ b/Documentation/trace/events.txt @@ -1,9 +1,10 @@ Event Tracing Documentation written by Theodore Ts'o + Updated by Li Zefan -Introduction -============ +1. Introduction +=============== Tracepoints (see Documentation/trace/tracepoints.txt) can be used without creating custom kernel modules to register probe functions @@ -12,30 +13,37 @@ using the event tracing infrastructure. Not all tracepoints can be traced using the event tracing system; the kernel developer must provide code snippets which define how the tracing information is saved into the tracing buffer, and how the -the tracing information should be printed. +tracing information should be printed. -Using Event Tracing -=================== +2. Using Event Tracing +====================== + +2.1 Via the 'set_event' interface +--------------------------------- The events which are available for tracing can be found in the file -/sys/kernel/debug/tracing/available_events. +/debug/tracing/available_events. To enable a particular event, such as 'sched_wakeup', simply echo it -to /sys/debug/tracing/set_event. For example: +to /debug/tracing/set_event. For example: - # echo sched_wakeup > /sys/kernel/debug/tracing/set_event + # echo sched_wakeup >> /debug/tracing/set_event -[ Note: events can also be enabled/disabled via the 'enabled' toggle - found in the /sys/kernel/tracing/events/ hierarchy of directories. ] +[ Note: '>>' is necessary, otherwise it will firstly disable + all the events. ] To disable an event, echo the event name to the set_event file prefixed with an exclamation point: - # echo '!sched_wakeup' >> /sys/kernel/debug/tracing/set_event + # echo '!sched_wakeup' >> /debug/tracing/set_event + +To disable all events, echo an empty line to the set_event file: + + # echo > /debug/tracing/set_event -To disable events, echo an empty line to the set_event file: +To enable all events, echo '*:*' or '*:' to the set_event file: - # echo > /sys/kernel/debug/tracing/set_event + # echo *:* > /debug/tracing/set_event The events are organized into subsystems, such as ext4, irq, sched, etc., and a full event name looks like this: :. The @@ -44,92 +52,39 @@ file. All of the events in a subsystem can be specified via the syntax ":*"; for example, to enable all irq events, you can use the command: - # echo 'irq:*' > /sys/kernel/debug/tracing/set_event - -Defining an event-enabled tracepoint ------------------------------------- - -A kernel developer which wishes to define an event-enabled tracepoint -must declare the tracepoint using TRACE_EVENT instead of DECLARE_TRACE. -This is done via two header files in include/trace. For example, to -event-enable the jbd2 subsystem, we must create two files, -include/trace/jbd2.h and include/trace/jbd2_event_types.h. The -include/trace/jbd2.h file should be included by kernel source files that -will have a tracepoint inserted, and might look like this: - -#ifndef _TRACE_JBD2_H -#define _TRACE_JBD2_H - -#include -#include - -#include - -#endif - -In a file that utilizes a jbd2 tracepoint, this header file would be -included. Note that you still have to use DEFINE_TRACE(). So for -example, if fs/jbd2/commit.c planned to use the jbd2_start_commit -tracepoint, it would have the following near the beginning of the file: - -#include - -DEFINE_TRACE(jbd2_start_commit); - -Then in the function that would call the tracepoint, it would call the -tracepoint function. (For more information, please see the tracepoint -documentation in Documentation/trace/tracepoints.txt): - - trace_jbd2_start_commit(journal, commit_transaction); - -The code snippets which allow jbd2_start_commit to be an event-enabled -tracepoint are placed in the file include/trace/jbd2_event_types.h: - -/* use instead */ -#ifndef TRACE_EVENT -# error Do not include this file directly. -# error Unless you know what you are doing. -#endif - -#undef TRACE_SYSTEM -#define TRACE_SYSTEM jbd2 - -#include - -TRACE_EVENT(jbd2_start_commit, - TP_PROTO(journal_t *journal, transaction_t *commit_transaction), - TP_ARGS(journal, commit_transaction), - TP_STRUCT__entry( - __array( char, devname, BDEVNAME_SIZE+24 ) - __field( int, transaction ) - ), - TP_fast_assign( - memcpy(__entry->devname, journal->j_devname, BDEVNAME_SIZE+24); - __entry->transaction = commit_transaction->t_tid; - ), - TP_printk("dev %s transaction %d", - __entry->devname, __entry->transaction) -); - -The TP_PROTO and TP_ARGS are unchanged from DECLARE_TRACE. The new -arguments to TRACE_EVENT are TP_STRUCT__entry, TP_fast_assign, and -TP_printk. - -TP_STRUCT__entry defines the data structure which will be stored in the -trace buffer. Normally, fields in __entry will be arrays or simple -types. It is possible to place data structures in __entry --- however, -pointers in the data structure can not be trusted, since they will be -accessed sometime later by TP_printk, and if the data structure contains -fields that will not or cannot be used by TP_printk, this will waste -space in the trace buffer. In general, data structures should be -avoided, unless they do only contain non-pointer types and all of the -fields will be used by TP_printk. - -TP_fast_assign defines the code snippet which saves information into the -__entry data structure, using the passed-in arguments defined in -TP_PROTO and TP_ARGS. - -Finally, TP_printk will print the __entry data structure. At the time -when the code snippet defined by TP_printk is executed, it will not have -access to the TP_ARGS arguments; it can only use the information saved -in the __entry data structure. + # echo 'irq:*' > /debug/tracing/set_event + +2.2 Via the 'enable' toggle +--------------------------- + +The events available are also listed in /debug/tracing/events/ hierarchy +of directories. + +To enable event 'sched_wakeup': + + # echo 1 > /debug/tracing/events/sched/sched_wakeup/enable + +To disable it: + + # echo 0 > /debug/tracing/events/sched/sched_wakeup/enable + +To enable all events in sched subsystem: + + # echo 1 > /debug/tracing/events/sched/enable + +To eanble all events: + + # echo 1 > /debug/tracing/events/enable + +When reading one of these enable files, there are four results: + + 0 - all events this file affects are disabled + 1 - all events this file affects are enabled + X - there is a mixture of events enabled and disabled + ? - this file does not affect any event + +3. Defining an event-enabled tracepoint +======================================= + +See The example provided in samples/trace_events + -- cgit v1.2.3 From e9ccb73ab57ada469602506496c42e5b4468ac3e Mon Sep 17 00:00:00 2001 From: Steven Whitehouse Date: Tue, 19 May 2009 10:23:23 +0100 Subject: GFS2: Update docs Update a few things which were out of date, and fix a typo. Signed-off-by: Steven Whitehouse --- Documentation/filesystems/gfs2-glocks.txt | 2 +- Documentation/filesystems/gfs2.txt | 19 +++++++++++-------- 2 files changed, 12 insertions(+), 9 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/gfs2-glocks.txt b/Documentation/filesystems/gfs2-glocks.txt index 4dae9a3840bf..0494f78d87e4 100644 --- a/Documentation/filesystems/gfs2-glocks.txt +++ b/Documentation/filesystems/gfs2-glocks.txt @@ -60,7 +60,7 @@ go_lock | Called for the first local holder of a lock go_unlock | Called on the final local unlock of a lock go_dump | Called to print content of object for debugfs file, or on | error to dump glock to the log. -go_type; | The type of the glock, LM_TYPE_..... +go_type | The type of the glock, LM_TYPE_..... go_min_hold_time | The minimum hold time The minimum hold time for each lock is the time after a remote lock diff --git a/Documentation/filesystems/gfs2.txt b/Documentation/filesystems/gfs2.txt index 593004b6bbab..5e3ab8f3beff 100644 --- a/Documentation/filesystems/gfs2.txt +++ b/Documentation/filesystems/gfs2.txt @@ -11,18 +11,15 @@ their I/O so file system consistency is maintained. One of the nifty features of GFS is perfect consistency -- changes made to the file system on one machine show up immediately on all other machines in the cluster. -GFS uses interchangable inter-node locking mechanisms. Different lock -modules can plug into GFS and each file system selects the appropriate -lock module at mount time. Lock modules include: +GFS uses interchangable inter-node locking mechanisms, the currently +supported mechanisms are: lock_nolock -- allows gfs to be used as a local file system lock_dlm -- uses a distributed lock manager (dlm) for inter-node locking The dlm is found at linux/fs/dlm/ -In addition to interfacing with an external locking manager, a gfs lock -module is responsible for interacting with external cluster management -systems. Lock_dlm depends on user space cluster management systems found +Lock_dlm depends on user space cluster management systems found at the URL above. To use gfs as a local file system, no external clustering systems are @@ -31,13 +28,19 @@ needed, simply: $ mkfs -t gfs2 -p lock_nolock -j 1 /dev/block_device $ mount -t gfs2 /dev/block_device /dir -GFS2 is not on-disk compatible with previous versions of GFS. +If you are using Fedora, you need to install the gfs2-utils package +and, for lock_dlm, you will also need to install the cman package +and write a cluster.conf as per the documentation. + +GFS2 is not on-disk compatible with previous versions of GFS, but it +is pretty close. The following man pages can be found at the URL above: - gfs2_fsck to repair a filesystem + fsck.gfs2 to repair a filesystem gfs2_grow to expand a filesystem online gfs2_jadd to add journals to a filesystem online gfs2_tool to manipulate, examine and tune a filesystem gfs2_quota to examine and change quota values in a filesystem + gfs2_convert to convert a gfs filesystem to gfs2 in-place mount.gfs2 to help mount(8) mount a filesystem mkfs.gfs2 to make a filesystem -- cgit v1.2.3 From 5789ba3bd0a3cd20df5980ebf03358f2eb44fd67 Mon Sep 17 00:00:00 2001 From: Eric Paris Date: Thu, 21 May 2009 15:47:06 -0400 Subject: IMA: Minimal IMA policy and boot param for TCB IMA policy The IMA TCB policy is dangerous. A normal use can use all of a system's memory (which cannot be freed) simply by building and running lots of executables. The TCB policy is also nearly useless because logging in as root often causes a policy violation when dealing with utmp, thus rendering the measurements meaningless. There is no good fix for this in the kernel. A full TCB policy would need to be loaded in userspace using LSM rule matching to get both a protected and useful system. But, if too little is measured before userspace can load a real policy one again ends up with a meaningless set of measurements. One option would be to put the policy load inside the initrd in order to get it early enough in the boot sequence to be useful, but this runs into trouble with the LSM. For IMA to measure the LSM policy and the LSM policy loading mechanism it needs rules to do so, but we already talked about problems with defaulting to such broad rules.... IMA also depends on the files being measured to be on an FS which implements and supports i_version. Since the only FS with this support (ext4) doesn't even use it by default it seems silly to have any IMA rules by default. This should reduce the performance overhead of IMA to near 0 while still letting users who choose to configure their machine as such to inclue the ima_tcb kernel paramenter and get measurements during boot before they can load a customized, reasonable policy in userspace. Signed-off-by: Eric Paris Acked-by: Mimi Zohar Signed-off-by: James Morris --- Documentation/kernel-parameters.txt | 6 ++++++ security/integrity/ima/ima_policy.c | 30 +++++++++++++++++++++++++++--- 2 files changed, 33 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index e87bdbfbcc75..d9a24a04cfb1 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -914,6 +914,12 @@ and is between 256 and 4096 characters. It is defined in the file Formt: { "sha1" | "md5" } default: "sha1" + ima_tcb [IMA] + Load a policy which meets the needs of the Trusted + Computing Base. This means IMA will measure all + programs exec'd, files mmap'd for exec, and all files + opened for read by uid=0. + in2000= [HW,SCSI] See header of drivers/scsi/in2000.c. diff --git a/security/integrity/ima/ima_policy.c b/security/integrity/ima/ima_policy.c index 31d677f7c65f..4719bbf1641a 100644 --- a/security/integrity/ima/ima_policy.c +++ b/security/integrity/ima/ima_policy.c @@ -45,9 +45,17 @@ struct ima_measure_rule_entry { } lsm[MAX_LSM_RULES]; }; -/* Without LSM specific knowledge, the default policy can only be +/* + * Without LSM specific knowledge, the default policy can only be * written in terms of .action, .func, .mask, .fsmagic, and .uid */ + +/* + * The minimum rule set to allow for full TCB coverage. Measures all files + * opened or mmap for exec and everything read by root. Dangerous because + * normal users can easily run the machine out of memory simply building + * and running executables. + */ static struct ima_measure_rule_entry default_rules[] = { {.action = DONT_MEASURE,.fsmagic = PROC_SUPER_MAGIC,.flags = IMA_FSMAGIC}, {.action = DONT_MEASURE,.fsmagic = SYSFS_MAGIC,.flags = IMA_FSMAGIC}, @@ -59,6 +67,8 @@ static struct ima_measure_rule_entry default_rules[] = { .flags = IMA_FUNC | IMA_MASK}, {.action = MEASURE,.func = BPRM_CHECK,.mask = MAY_EXEC, .flags = IMA_FUNC | IMA_MASK}, + {.action = MEASURE,.func = PATH_CHECK,.mask = MAY_READ,.uid = 0, + .flags = IMA_FUNC | IMA_MASK | IMA_UID}, }; static LIST_HEAD(measure_default_rules); @@ -67,6 +77,14 @@ static struct list_head *ima_measure; static DEFINE_MUTEX(ima_measure_mutex); +static bool ima_use_tcb __initdata; +static int __init default_policy_setup(char *str) +{ + ima_use_tcb = 1; + return 1; +} +__setup("ima_tcb", default_policy_setup); + /** * ima_match_rules - determine whether an inode matches the measure rule. * @rule: a pointer to a rule @@ -162,9 +180,15 @@ int ima_match_policy(struct inode *inode, enum ima_hooks func, int mask) */ void ima_init_policy(void) { - int i; + int i, entries; + + /* if !ima_use_tcb set entries = 0 so we load NO default rules */ + if (ima_use_tcb) + entries = ARRAY_SIZE(default_rules); + else + entries = 0; - for (i = 0; i < ARRAY_SIZE(default_rules); i++) + for (i = 0; i < entries; i++) list_add_tail(&default_rules[i].list, &measure_default_rules); ima_measure = &measure_default_rules; } -- cgit v1.2.3 From c72758f33784e5e2a1a4bb9421ef3e6de8f9fcf3 Mon Sep 17 00:00:00 2001 From: "Martin K. Petersen" Date: Fri, 22 May 2009 17:17:53 -0400 Subject: block: Export I/O topology for block devices and partitions To support devices with physical block sizes bigger than 512 bytes we need to ensure proper alignment. This patch adds support for exposing I/O topology characteristics as devices are stacked. logical_block_size is the smallest unit the device can address. physical_block_size indicates the smallest I/O the device can write without incurring a read-modify-write penalty. The io_min parameter is the smallest preferred I/O size reported by the device. In many cases this is the same as the physical block size. However, the io_min parameter can be scaled up when stacking (RAID5 chunk size > physical block size). The io_opt characteristic indicates the optimal I/O size reported by the device. This is usually the stripe width for arrays. The alignment_offset parameter indicates the number of bytes the start of the device/partition is offset from the device's natural alignment. Partition tools and MD/DM utilities can use this to pad their offsets so filesystems start on proper boundaries. Signed-off-by: Martin K. Petersen Signed-off-by: Jens Axboe --- Documentation/ABI/testing/sysfs-block | 59 +++++++++++ block/blk-settings.c | 186 ++++++++++++++++++++++++++++++++++ block/blk-sysfs.c | 33 ++++++ block/genhd.c | 11 ++ fs/partitions/check.c | 10 ++ include/linux/blkdev.h | 47 +++++++++ include/linux/genhd.h | 1 + 7 files changed, 347 insertions(+) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block index 44f52a4f5903..cbbd3e069945 100644 --- a/Documentation/ABI/testing/sysfs-block +++ b/Documentation/ABI/testing/sysfs-block @@ -60,3 +60,62 @@ Description: Indicates whether the block layer should automatically generate checksums for write requests bound for devices that support receiving integrity metadata. + +What: /sys/block//alignment_offset +Date: April 2009 +Contact: Martin K. Petersen +Description: + Storage devices may report a physical block size that is + bigger than the logical block size (for instance a drive + with 4KB physical sectors exposing 512-byte logical + blocks to the operating system). This parameter + indicates how many bytes the beginning of the device is + offset from the disk's natural alignment. + +What: /sys/block///alignment_offset +Date: April 2009 +Contact: Martin K. Petersen +Description: + Storage devices may report a physical block size that is + bigger than the logical block size (for instance a drive + with 4KB physical sectors exposing 512-byte logical + blocks to the operating system). This parameter + indicates how many bytes the beginning of the partition + is offset from the disk's natural alignment. + +What: /sys/block//queue/logical_block_size +Date: May 2009 +Contact: Martin K. Petersen +Description: + This is the smallest unit the storage device can + address. It is typically 512 bytes. + +What: /sys/block//queue/physical_block_size +Date: May 2009 +Contact: Martin K. Petersen +Description: + This is the smallest unit the storage device can write + without resorting to read-modify-write operation. It is + usually the same as the logical block size but may be + bigger. One example is SATA drives with 4KB sectors + that expose a 512-byte logical block size to the + operating system. + +What: /sys/block//queue/minimum_io_size +Date: April 2009 +Contact: Martin K. Petersen +Description: + Storage devices may report a preferred minimum I/O size, + which is the smallest request the device can perform + without incurring a read-modify-write penalty. For disk + drives this is often the physical block size. For RAID + arrays it is often the stripe chunk size. + +What: /sys/block//queue/optimal_io_size +Date: April 2009 +Contact: Martin K. Petersen +Description: + Storage devices may report an optimal I/O size, which is + the device's preferred unit of receiving I/O. This is + rarely reported for disk drives. For RAID devices it is + usually the stripe width or the internal block size. diff --git a/block/blk-settings.c b/block/blk-settings.c index b0f547cecfb8..5649f34adb40 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -309,9 +309,94 @@ EXPORT_SYMBOL(blk_queue_max_segment_size); void blk_queue_logical_block_size(struct request_queue *q, unsigned short size) { q->limits.logical_block_size = size; + + if (q->limits.physical_block_size < size) + q->limits.physical_block_size = size; + + if (q->limits.io_min < q->limits.physical_block_size) + q->limits.io_min = q->limits.physical_block_size; } EXPORT_SYMBOL(blk_queue_logical_block_size); +/** + * blk_queue_physical_block_size - set physical block size for the queue + * @q: the request queue for the device + * @size: the physical block size, in bytes + * + * Description: + * This should be set to the lowest possible sector size that the + * hardware can operate on without reverting to read-modify-write + * operations. + */ +void blk_queue_physical_block_size(struct request_queue *q, unsigned short size) +{ + q->limits.physical_block_size = size; + + if (q->limits.physical_block_size < q->limits.logical_block_size) + q->limits.physical_block_size = q->limits.logical_block_size; + + if (q->limits.io_min < q->limits.physical_block_size) + q->limits.io_min = q->limits.physical_block_size; +} +EXPORT_SYMBOL(blk_queue_physical_block_size); + +/** + * blk_queue_alignment_offset - set physical block alignment offset + * @q: the request queue for the device + * @alignment: alignment offset in bytes + * + * Description: + * Some devices are naturally misaligned to compensate for things like + * the legacy DOS partition table 63-sector offset. Low-level drivers + * should call this function for devices whose first sector is not + * naturally aligned. + */ +void blk_queue_alignment_offset(struct request_queue *q, unsigned int offset) +{ + q->limits.alignment_offset = + offset & (q->limits.physical_block_size - 1); + q->limits.misaligned = 0; +} +EXPORT_SYMBOL(blk_queue_alignment_offset); + +/** + * blk_queue_io_min - set minimum request size for the queue + * @q: the request queue for the device + * @io_min: smallest I/O size in bytes + * + * Description: + * Some devices have an internal block size bigger than the reported + * hardware sector size. This function can be used to signal the + * smallest I/O the device can perform without incurring a performance + * penalty. + */ +void blk_queue_io_min(struct request_queue *q, unsigned int min) +{ + q->limits.io_min = min; + + if (q->limits.io_min < q->limits.logical_block_size) + q->limits.io_min = q->limits.logical_block_size; + + if (q->limits.io_min < q->limits.physical_block_size) + q->limits.io_min = q->limits.physical_block_size; +} +EXPORT_SYMBOL(blk_queue_io_min); + +/** + * blk_queue_io_opt - set optimal request size for the queue + * @q: the request queue for the device + * @io_opt: optimal request size in bytes + * + * Description: + * Drivers can call this function to set the preferred I/O request + * size for devices that report such a value. + */ +void blk_queue_io_opt(struct request_queue *q, unsigned int opt) +{ + q->limits.io_opt = opt; +} +EXPORT_SYMBOL(blk_queue_io_opt); + /* * Returns the minimum that is _not_ zero, unless both are zero. */ @@ -357,6 +442,107 @@ void blk_queue_stack_limits(struct request_queue *t, struct request_queue *b) } EXPORT_SYMBOL(blk_queue_stack_limits); +/** + * blk_stack_limits - adjust queue_limits for stacked devices + * @t: the stacking driver limits (top) + * @bdev: the underlying queue limits (bottom) + * @offset: offset to beginning of data within component device + * + * Description: + * Merges two queue_limit structs. Returns 0 if alignment didn't + * change. Returns -1 if adding the bottom device caused + * misalignment. + */ +int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, + sector_t offset) +{ + t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors); + t->max_hw_sectors = min_not_zero(t->max_hw_sectors, b->max_hw_sectors); + + t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask, + b->seg_boundary_mask); + + t->max_phys_segments = min_not_zero(t->max_phys_segments, + b->max_phys_segments); + + t->max_hw_segments = min_not_zero(t->max_hw_segments, + b->max_hw_segments); + + t->max_segment_size = min_not_zero(t->max_segment_size, + b->max_segment_size); + + t->logical_block_size = max(t->logical_block_size, + b->logical_block_size); + + t->physical_block_size = max(t->physical_block_size, + b->physical_block_size); + + t->io_min = max(t->io_min, b->io_min); + t->no_cluster |= b->no_cluster; + + /* Bottom device offset aligned? */ + if (offset && + (offset & (b->physical_block_size - 1)) != b->alignment_offset) { + t->misaligned = 1; + return -1; + } + + /* If top has no alignment offset, inherit from bottom */ + if (!t->alignment_offset) + t->alignment_offset = + b->alignment_offset & (b->physical_block_size - 1); + + /* Top device aligned on logical block boundary? */ + if (t->alignment_offset & (t->logical_block_size - 1)) { + t->misaligned = 1; + return -1; + } + + return 0; +} + +/** + * disk_stack_limits - adjust queue limits for stacked drivers + * @t: MD/DM gendisk (top) + * @bdev: the underlying block device (bottom) + * @offset: offset to beginning of data within component device + * + * Description: + * Merges the limits for two queues. Returns 0 if alignment + * didn't change. Returns -1 if adding the bottom device caused + * misalignment. + */ +void disk_stack_limits(struct gendisk *disk, struct block_device *bdev, + sector_t offset) +{ + struct request_queue *t = disk->queue; + struct request_queue *b = bdev_get_queue(bdev); + + offset += get_start_sect(bdev) << 9; + + if (blk_stack_limits(&t->limits, &b->limits, offset) < 0) { + char top[BDEVNAME_SIZE], bottom[BDEVNAME_SIZE]; + + disk_name(disk, 0, top); + bdevname(bdev, bottom); + + printk(KERN_NOTICE "%s: Warning: Device %s is misaligned\n", + top, bottom); + } + + if (!t->queue_lock) + WARN_ON_ONCE(1); + else if (!test_bit(QUEUE_FLAG_CLUSTER, &b->queue_flags)) { + unsigned long flags; + + spin_lock_irqsave(t->queue_lock, flags); + if (!test_bit(QUEUE_FLAG_CLUSTER, &b->queue_flags)) + queue_flag_clear(QUEUE_FLAG_CLUSTER, t); + spin_unlock_irqrestore(t->queue_lock, flags); + } +} +EXPORT_SYMBOL(disk_stack_limits); + /** * blk_queue_dma_pad - set pad mask * @q: the request queue for the device diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 3ccdadb8e204..9337e17f9110 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -105,6 +105,21 @@ static ssize_t queue_logical_block_size_show(struct request_queue *q, char *page return queue_var_show(queue_logical_block_size(q), page); } +static ssize_t queue_physical_block_size_show(struct request_queue *q, char *page) +{ + return queue_var_show(queue_physical_block_size(q), page); +} + +static ssize_t queue_io_min_show(struct request_queue *q, char *page) +{ + return queue_var_show(queue_io_min(q), page); +} + +static ssize_t queue_io_opt_show(struct request_queue *q, char *page) +{ + return queue_var_show(queue_io_opt(q), page); +} + static ssize_t queue_max_sectors_store(struct request_queue *q, const char *page, size_t count) { @@ -257,6 +272,21 @@ static struct queue_sysfs_entry queue_logical_block_size_entry = { .show = queue_logical_block_size_show, }; +static struct queue_sysfs_entry queue_physical_block_size_entry = { + .attr = {.name = "physical_block_size", .mode = S_IRUGO }, + .show = queue_physical_block_size_show, +}; + +static struct queue_sysfs_entry queue_io_min_entry = { + .attr = {.name = "minimum_io_size", .mode = S_IRUGO }, + .show = queue_io_min_show, +}; + +static struct queue_sysfs_entry queue_io_opt_entry = { + .attr = {.name = "optimal_io_size", .mode = S_IRUGO }, + .show = queue_io_opt_show, +}; + static struct queue_sysfs_entry queue_nonrot_entry = { .attr = {.name = "rotational", .mode = S_IRUGO | S_IWUSR }, .show = queue_nonrot_show, @@ -289,6 +319,9 @@ static struct attribute *default_attrs[] = { &queue_iosched_entry.attr, &queue_hw_sector_size_entry.attr, &queue_logical_block_size_entry.attr, + &queue_physical_block_size_entry.attr, + &queue_io_min_entry.attr, + &queue_io_opt_entry.attr, &queue_nonrot_entry.attr, &queue_nomerges_entry.attr, &queue_rq_affinity_entry.attr, diff --git a/block/genhd.c b/block/genhd.c index 1a4916e01732..fe7ccc0a618f 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -852,11 +852,21 @@ static ssize_t disk_capability_show(struct device *dev, return sprintf(buf, "%x\n", disk->flags); } +static ssize_t disk_alignment_offset_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct gendisk *disk = dev_to_disk(dev); + + return sprintf(buf, "%d\n", queue_alignment_offset(disk->queue)); +} + static DEVICE_ATTR(range, S_IRUGO, disk_range_show, NULL); static DEVICE_ATTR(ext_range, S_IRUGO, disk_ext_range_show, NULL); static DEVICE_ATTR(removable, S_IRUGO, disk_removable_show, NULL); static DEVICE_ATTR(ro, S_IRUGO, disk_ro_show, NULL); static DEVICE_ATTR(size, S_IRUGO, part_size_show, NULL); +static DEVICE_ATTR(alignment_offset, S_IRUGO, disk_alignment_offset_show, NULL); static DEVICE_ATTR(capability, S_IRUGO, disk_capability_show, NULL); static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL); #ifdef CONFIG_FAIL_MAKE_REQUEST @@ -875,6 +885,7 @@ static struct attribute *disk_attrs[] = { &dev_attr_removable.attr, &dev_attr_ro.attr, &dev_attr_size.attr, + &dev_attr_alignment_offset.attr, &dev_attr_capability.attr, &dev_attr_stat.attr, #ifdef CONFIG_FAIL_MAKE_REQUEST diff --git a/fs/partitions/check.c b/fs/partitions/check.c index 99e33ef40be4..0af36085eb28 100644 --- a/fs/partitions/check.c +++ b/fs/partitions/check.c @@ -219,6 +219,13 @@ ssize_t part_size_show(struct device *dev, return sprintf(buf, "%llu\n",(unsigned long long)p->nr_sects); } +ssize_t part_alignment_offset_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct hd_struct *p = dev_to_part(dev); + return sprintf(buf, "%llu\n", (unsigned long long)p->alignment_offset); +} + ssize_t part_stat_show(struct device *dev, struct device_attribute *attr, char *buf) { @@ -272,6 +279,7 @@ ssize_t part_fail_store(struct device *dev, static DEVICE_ATTR(partition, S_IRUGO, part_partition_show, NULL); static DEVICE_ATTR(start, S_IRUGO, part_start_show, NULL); static DEVICE_ATTR(size, S_IRUGO, part_size_show, NULL); +static DEVICE_ATTR(alignment_offset, S_IRUGO, part_alignment_offset_show, NULL); static DEVICE_ATTR(stat, S_IRUGO, part_stat_show, NULL); #ifdef CONFIG_FAIL_MAKE_REQUEST static struct device_attribute dev_attr_fail = @@ -282,6 +290,7 @@ static struct attribute *part_attrs[] = { &dev_attr_partition.attr, &dev_attr_start.attr, &dev_attr_size.attr, + &dev_attr_alignment_offset.attr, &dev_attr_stat.attr, #ifdef CONFIG_FAIL_MAKE_REQUEST &dev_attr_fail.attr, @@ -383,6 +392,7 @@ struct hd_struct *add_partition(struct gendisk *disk, int partno, pdev = part_to_dev(p); p->start_sect = start; + p->alignment_offset = queue_sector_alignment_offset(disk->queue, start); p->nr_sects = len; p->partno = partno; p->policy = get_disk_ro(disk); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index b7bb6fdba12c..5e740a135e73 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -314,11 +314,16 @@ struct queue_limits { unsigned int max_hw_sectors; unsigned int max_sectors; unsigned int max_segment_size; + unsigned int physical_block_size; + unsigned int alignment_offset; + unsigned int io_min; + unsigned int io_opt; unsigned short logical_block_size; unsigned short max_hw_segments; unsigned short max_phys_segments; + unsigned char misaligned; unsigned char no_cluster; }; @@ -911,6 +916,15 @@ extern void blk_queue_max_phys_segments(struct request_queue *, unsigned short); extern void blk_queue_max_hw_segments(struct request_queue *, unsigned short); extern void blk_queue_max_segment_size(struct request_queue *, unsigned int); extern void blk_queue_logical_block_size(struct request_queue *, unsigned short); +extern void blk_queue_physical_block_size(struct request_queue *, unsigned short); +extern void blk_queue_alignment_offset(struct request_queue *q, + unsigned int alignment); +extern void blk_queue_io_min(struct request_queue *q, unsigned int min); +extern void blk_queue_io_opt(struct request_queue *q, unsigned int opt); +extern int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, + sector_t offset); +extern void disk_stack_limits(struct gendisk *disk, struct block_device *bdev, + sector_t offset); extern void blk_queue_stack_limits(struct request_queue *t, struct request_queue *b); extern void blk_queue_dma_pad(struct request_queue *, unsigned int); extern void blk_queue_update_dma_pad(struct request_queue *, unsigned int); @@ -1047,6 +1061,39 @@ static inline unsigned short bdev_logical_block_size(struct block_device *bdev) return queue_logical_block_size(bdev_get_queue(bdev)); } +static inline unsigned int queue_physical_block_size(struct request_queue *q) +{ + return q->limits.physical_block_size; +} + +static inline unsigned int queue_io_min(struct request_queue *q) +{ + return q->limits.io_min; +} + +static inline unsigned int queue_io_opt(struct request_queue *q) +{ + return q->limits.io_opt; +} + +static inline int queue_alignment_offset(struct request_queue *q) +{ + if (q && q->limits.misaligned) + return -1; + + if (q && q->limits.alignment_offset) + return q->limits.alignment_offset; + + return 0; +} + +static inline int queue_sector_alignment_offset(struct request_queue *q, + sector_t sector) +{ + return ((sector << 9) - q->limits.alignment_offset) + & (q->limits.io_min - 1); +} + static inline int queue_dma_alignment(struct request_queue *q) { return q ? q->dma_alignment : 511; diff --git a/include/linux/genhd.h b/include/linux/genhd.h index a1a28caed23d..149fda264c86 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -90,6 +90,7 @@ struct disk_stats { struct hd_struct { sector_t start_sect; sector_t nr_sects; + sector_t alignment_offset; struct device __dev; struct kobject *holder_dir; int policy, partno; -- cgit v1.2.3 From 29fcefba8a2f0fea11e2b721fe174a1832801284 Mon Sep 17 00:00:00 2001 From: Pekka Enberg Date: Sun, 24 May 2009 11:13:17 +0300 Subject: kmemtrace: fix kernel parameter documentation The kmemtrace.enable kernel parameter no longer works. To enable kmemtrace at boot-time, you must pass "ftrace=kmemtrace" instead. [ Impact: remove obsolete kernel parameter documentation ] Cc: Eduard - Gabriel Munteanu Signed-off-by: Pekka Enberg LKML-Reference: Signed-off-by: Frederic Weisbecker --- Documentation/kernel-parameters.txt | 10 ---------- 1 file changed, 10 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index e87bdbfbcc75..9243dd84f4d6 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -56,7 +56,6 @@ parameter is applicable: ISAPNP ISA PnP code is enabled. ISDN Appropriate ISDN support is enabled. JOY Appropriate joystick support is enabled. - KMEMTRACE kmemtrace is enabled. LIBATA Libata driver is enabled LP Printer support is enabled. LOOP Loopback device support is enabled. @@ -1054,15 +1053,6 @@ and is between 256 and 4096 characters. It is defined in the file use the HighMem zone if it exists, and the Normal zone if it does not. - kmemtrace.enable= [KNL,KMEMTRACE] Format: { yes | no } - Controls whether kmemtrace is enabled - at boot-time. - - kmemtrace.subbufs=n [KNL,KMEMTRACE] Overrides the number of - subbufs kmemtrace's relay channel has. Set this - higher than default (KMEMTRACE_N_SUBBUFS in code) if - you experience buffer overruns. - kgdboc= [HW] kgdb over consoles. Requires a tty driver that supports console polling. (only serial suported for now) -- cgit v1.2.3 From d9cfed925448f097ec7faab80d903eb7e5f99712 Mon Sep 17 00:00:00 2001 From: Joerg Roedel Date: Tue, 19 May 2009 12:16:29 +0200 Subject: amd-iommu: remove amd_iommu_size kernel parameter This parameter is not longer necessary when aperture increases dynamically. Signed-off-by: Joerg Roedel --- Documentation/kernel-parameters.txt | 5 ----- arch/x86/kernel/amd_iommu.c | 18 ++++-------------- arch/x86/kernel/amd_iommu_init.c | 15 --------------- 3 files changed, 4 insertions(+), 34 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index e87bdbfbcc75..5b776c6e7964 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -329,11 +329,6 @@ and is between 256 and 4096 characters. It is defined in the file flushed before they will be reused, which is a lot of faster - amd_iommu_size= [HW,X86-64] - Define the size of the aperture for the AMD IOMMU - driver. Possible values are: - '32M', '64M' (default), '128M', '256M', '512M', '1G' - amijoy.map= [HW,JOY] Amiga joystick support Map of devices attached to JOY0DAT and JOY1DAT Format: , diff --git a/arch/x86/kernel/amd_iommu.c b/arch/x86/kernel/amd_iommu.c index d129d8feba07..31d56c36010a 100644 --- a/arch/x86/kernel/amd_iommu.c +++ b/arch/x86/kernel/amd_iommu.c @@ -939,17 +939,10 @@ static void dma_ops_domain_free(struct dma_ops_domain *dom) * It also intializes the page table and the address allocator data * structures required for the dma_ops interface */ -static struct dma_ops_domain *dma_ops_domain_alloc(struct amd_iommu *iommu, - unsigned order) +static struct dma_ops_domain *dma_ops_domain_alloc(struct amd_iommu *iommu) { struct dma_ops_domain *dma_dom; - /* - * Currently the DMA aperture must be between 32 MB and 1GB in size - */ - if ((order < 25) || (order > 30)) - return NULL; - dma_dom = kzalloc(sizeof(struct dma_ops_domain), GFP_KERNEL); if (!dma_dom) return NULL; @@ -1087,7 +1080,6 @@ static int device_change_notifier(struct notifier_block *nb, struct protection_domain *domain; struct dma_ops_domain *dma_domain; struct amd_iommu *iommu; - int order = amd_iommu_aperture_order; unsigned long flags; if (devid > amd_iommu_last_bdf) @@ -1126,7 +1118,7 @@ static int device_change_notifier(struct notifier_block *nb, dma_domain = find_protection_domain(devid); if (dma_domain) goto out; - dma_domain = dma_ops_domain_alloc(iommu, order); + dma_domain = dma_ops_domain_alloc(iommu); if (!dma_domain) goto out; dma_domain->target_dev = devid; @@ -1826,7 +1818,6 @@ static void prealloc_protection_domains(void) struct pci_dev *dev = NULL; struct dma_ops_domain *dma_dom; struct amd_iommu *iommu; - int order = amd_iommu_aperture_order; u16 devid; while ((dev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) { @@ -1839,7 +1830,7 @@ static void prealloc_protection_domains(void) iommu = amd_iommu_rlookup_table[devid]; if (!iommu) continue; - dma_dom = dma_ops_domain_alloc(iommu, order); + dma_dom = dma_ops_domain_alloc(iommu); if (!dma_dom) continue; init_unity_mappings_for_device(dma_dom, devid); @@ -1865,7 +1856,6 @@ static struct dma_map_ops amd_iommu_dma_ops = { int __init amd_iommu_init_dma_ops(void) { struct amd_iommu *iommu; - int order = amd_iommu_aperture_order; int ret; /* @@ -1874,7 +1864,7 @@ int __init amd_iommu_init_dma_ops(void) * protection domain will be assigned to the default one. */ list_for_each_entry(iommu, &amd_iommu_list, list) { - iommu->default_dom = dma_ops_domain_alloc(iommu, order); + iommu->default_dom = dma_ops_domain_alloc(iommu); if (iommu->default_dom == NULL) return -ENOMEM; iommu->default_dom->domain.flags |= PD_DEFAULT_MASK; diff --git a/arch/x86/kernel/amd_iommu_init.c b/arch/x86/kernel/amd_iommu_init.c index 8c0be0902dac..762a4eefec93 100644 --- a/arch/x86/kernel/amd_iommu_init.c +++ b/arch/x86/kernel/amd_iommu_init.c @@ -121,7 +121,6 @@ u16 amd_iommu_last_bdf; /* largest PCI device id we have to handle */ LIST_HEAD(amd_iommu_unity_map); /* a list of required unity mappings we find in ACPI */ -unsigned amd_iommu_aperture_order = 26; /* size of aperture in power of 2 */ bool amd_iommu_isolate = true; /* if true, device isolation is enabled */ bool amd_iommu_unmap_flush; /* if true, flush on every unmap */ @@ -1137,9 +1136,6 @@ int __init amd_iommu_init(void) enable_iommus(); - printk(KERN_INFO "AMD IOMMU: aperture size is %d MB\n", - (1 << (amd_iommu_aperture_order-20))); - printk(KERN_INFO "AMD IOMMU: device isolation "); if (amd_iommu_isolate) printk("enabled\n"); @@ -1225,15 +1221,4 @@ static int __init parse_amd_iommu_options(char *str) return 1; } -static int __init parse_amd_iommu_size_options(char *str) -{ - unsigned order = PAGE_SHIFT + get_order(memparse(str, &str)); - - if ((order > 24) && (order < 31)) - amd_iommu_aperture_order = order; - - return 1; -} - __setup("amd_iommu=", parse_amd_iommu_options); -__setup("amd_iommu_size=", parse_amd_iommu_size_options); -- cgit v1.2.3 From 294ae4011530d008c59c4fb9847738e39228821e Mon Sep 17 00:00:00 2001 From: GeunSik Lim Date: Thu, 28 May 2009 10:36:11 +0900 Subject: ftrace: fix typo about map of kernel priority in ftrace.txt file. Fix typo about chart to map the kernel priority to user land priorities. * About sched_setscheduler(2) Processes scheduled under SCHED_FIFO or SCHED_RR can have a (user-space) static priority in the range 1 to 99. (reference: http://www.kernel.org/doc/man-pages/online/pages/ man2/sched_setscheduler.2.html) * From: Steven Rostedt 0 to 98 - maps to RT tasks 99 to 1 (SCHED_RR or SCHED_FIFO) 99 - maps to internal kernel threads that want to be lower than RT tasks but higher than SCHED_OTHER tasks. Although I'm not sure if any kernel thread actually uses this. I'm not even sure how this can be set, because the internal sched_setscheduler function does not allow for it. 100 to 139 - maps nice levels -20 to 19. These are not set via sched_setscheduler, but are set via the nice system call. 140 - reserved for idle tasks. Signed-off-by: GeunSik Lim Acked-by: Steven Rostedt Signed-off-by: Peter Zijlstra LKML-Reference: Signed-off-by: Ingo Molnar --- Documentation/trace/ftrace.txt | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt index fd9a3e693813..e362f50c496f 100644 --- a/Documentation/trace/ftrace.txt +++ b/Documentation/trace/ftrace.txt @@ -518,9 +518,18 @@ priority with zero (0) being the highest priority and the nice values starting at 100 (nice -20). Below is a quick chart to map the kernel priority to user land priorities. - Kernel priority: 0 to 99 ==> user RT priority 99 to 0 - Kernel priority: 100 to 139 ==> user nice -20 to 19 - Kernel priority: 140 ==> idle task priority + Kernel Space User Space + =============================================================== + 0(high) to 98(low) user RT priority 99(high) to 1(low) + with SCHED_RR or SCHED_FIFO + --------------------------------------------------------------- + 99 sched_priority is not used in scheduling + decisions(it must be specified as 0) + --------------------------------------------------------------- + 100(high) to 139(low) user nice -20(high) to 19(low) + --------------------------------------------------------------- + 140 idle task priority + --------------------------------------------------------------- The task states are: -- cgit v1.2.3 From f04d82b7e0c63d0251f9952a537a4bc4d73aa1a9 Mon Sep 17 00:00:00 2001 From: GeunSik Lim Date: Thu, 28 May 2009 10:36:14 +0900 Subject: sched: fix typo in sched-rt-group.txt file Fix typo about static priority's range. Kernel Space User Space =============================================================== 0(high) to 98(low) user RT priority 99(high) to 1(low) with SCHED_RR or SCHED_FIFO --------------------------------------------------------------- 99 sched_priority is not used in scheduling decisions(it must be specified as 0) --------------------------------------------------------------- 100(high) to 139(low) user nice -20(high) to 19(low) --------------------------------------------------------------- 140 idle task priority --------------------------------------------------------------- * ref) http://www.kernel.org/doc/man-pages/online/pages/man2/sched_setscheduler.2.html Signed-off-by: GeunSik Lim CC: Steven Rostedt Signed-off-by: Peter Zijlstra LKML-Reference: Signed-off-by: Ingo Molnar --- Documentation/scheduler/sched-rt-group.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/scheduler/sched-rt-group.txt b/Documentation/scheduler/sched-rt-group.txt index eb74b014a3f8..1df7f9cdab05 100644 --- a/Documentation/scheduler/sched-rt-group.txt +++ b/Documentation/scheduler/sched-rt-group.txt @@ -187,7 +187,7 @@ get their allocated time. Implementing SCHED_EDF might take a while to complete. Priority Inheritance is the biggest challenge as the current linux PI infrastructure is geared towards -the limited static priority levels 0-139. With deadline scheduling you need to +the limited static priority levels 0-99. With deadline scheduling you need to do deadline inheritance (since priority is inversely proportional to the deadline delta (deadline - now). -- cgit v1.2.3 From 2af15d6a44b871ad4c2a651302374cde8f335480 Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Thu, 28 May 2009 13:37:24 -0400 Subject: ftrace: add kernel command line function filtering When using ftrace=function on the command line to trace functions on boot up, one can not filter out functions that are commonly called. This patch adds two new ftrace command line commands. ftrace_notrace=function-list ftrace_filter=function-list Where function-list is a comma separated list of functions to filter. The ftrace_notrace will make the functions listed not be included in the function tracing, and ftrace_filter will only trace the functions listed. These two act the same as the debugfs/tracing/set_ftrace_notrace and debugfs/tracing/set_ftrace_filter respectively. The simple glob expressions that are allowed by the filter files can also be used by the command line interface. ftrace_notrace=rcu*,*lock,*spin* Will not trace any function that starts with rcu, ends with lock, or has the word spin in it. Note, if the self tests are enabled, they may interfere with the filtering set by the command lines. Signed-off-by: Steven Rostedt --- Documentation/kernel-parameters.txt | 17 +++++++++++++-- kernel/trace/ftrace.c | 42 +++++++++++++++++++++++++++++++++++++ 2 files changed, 57 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 9243dd84f4d6..fcd3bfbe74e8 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -751,12 +751,25 @@ and is between 256 and 4096 characters. It is defined in the file ia64_pal_cache_flush instead of SAL_CACHE_FLUSH. ftrace=[tracer] - [ftrace] will set and start the specified tracer + [FTRACE] will set and start the specified tracer as early as possible in order to facilitate early boot debugging. ftrace_dump_on_oops - [ftrace] will dump the trace buffers on oops. + [FTRACE] will dump the trace buffers on oops. + + ftrace_filter=[function-list] + [FTRACE] Limit the functions traced by the function + tracer at boot up. function-list is a comma separated + list of functions. This list can be changed at run + time by the set_ftrace_filter file in the debugfs + tracing directory. + + ftrace_notrace=[function-list] + [FTRACE] Do not trace the functions specified in + function-list. This list can be changed at run time + by the set_ftrace_notrace file in the debugfs + tracing directory. gamecon.map[2|3]= [HW,JOY] Multisystem joystick and NES/SNES/PSX pad diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 140699a9a8a7..2074e5b7766b 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -32,6 +32,7 @@ #include #include +#include #include "trace_output.h" #include "trace_stat.h" @@ -2369,6 +2370,45 @@ void ftrace_set_notrace(unsigned char *buf, int len, int reset) ftrace_set_regex(buf, len, reset, 0); } +/* + * command line interface to allow users to set filters on boot up. + */ +#define FTRACE_FILTER_SIZE COMMAND_LINE_SIZE +static char ftrace_notrace_buf[FTRACE_FILTER_SIZE] __initdata; +static char ftrace_filter_buf[FTRACE_FILTER_SIZE] __initdata; + +static int __init set_ftrace_notrace(char *str) +{ + strncpy(ftrace_notrace_buf, str, FTRACE_FILTER_SIZE); + return 1; +} +__setup("ftrace_notrace=", set_ftrace_notrace); + +static int __init set_ftrace_filter(char *str) +{ + strncpy(ftrace_filter_buf, str, FTRACE_FILTER_SIZE); + return 1; +} +__setup("ftrace_filter=", set_ftrace_filter); + +static void __init set_ftrace_early_filter(char *buf, int enable) +{ + char *func; + + while (buf) { + func = strsep(&buf, ","); + ftrace_set_regex(func, strlen(func), 0, enable); + } +} + +static void __init set_ftrace_early_filters(void) +{ + if (ftrace_filter_buf[0]) + set_ftrace_early_filter(ftrace_filter_buf, 1); + if (ftrace_notrace_buf[0]) + set_ftrace_early_filter(ftrace_notrace_buf, 0); +} + static int ftrace_regex_release(struct inode *inode, struct file *file, int enable) { @@ -2829,6 +2869,8 @@ void __init ftrace_init(void) if (ret) pr_warning("Failed to register trace ftrace module notifier\n"); + set_ftrace_early_filters(); + return; failed: ftrace_disabled = 1; -- cgit v1.2.3 From 7fe063268e73681cdca1a6496a25f93d3332f517 Mon Sep 17 00:00:00 2001 From: Andrew Patterson Date: Tue, 2 Jun 2009 14:48:39 +0200 Subject: cciss: add cciss driver sysfs entries Add sysfs entries to the cciss driver needed for the dm/multipath tools. A file for vendor, model, rev, and unique_id is added for each logical drive under directory /sys/bus/pci/devices//ccissX/cXdY. Where X = the controller (or host) number and Y is the logical drive number. A link from /sys/bus/pci/devices//ccissX/cXdY/block:cciss!cXdY to /sys/block/cciss!cXdY/device is also created. A bus is created in /sys/bus/cciss. A link is created from the pci ccissX entry to /sys/bus/cciss/devices/ccissX. Please consider this for inclusion. Signed-off-by: Mike Miller Cc: Stephen M. Cameron Signed-off-by: Jens Axboe --- .../ABI/testing/sysfs-bus-pci-devices-cciss | 33 +++ drivers/block/cciss.c | 267 ++++++++++++++++++++- drivers/block/cciss.h | 24 +- 3 files changed, 314 insertions(+), 10 deletions(-) create mode 100644 Documentation/ABI/testing/sysfs-bus-pci-devices-cciss (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-cciss b/Documentation/ABI/testing/sysfs-bus-pci-devices-cciss new file mode 100644 index 000000000000..0a92a7c93a62 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-cciss @@ -0,0 +1,33 @@ +Where: /sys/bus/pci/devices//ccissX/cXdY/model +Date: March 2009 +Kernel Version: 2.6.30 +Contact: iss_storagedev@hp.com +Description: Displays the SCSI INQUIRY page 0 model for logical drive + Y of controller X. + +Where: /sys/bus/pci/devices//ccissX/cXdY/rev +Date: March 2009 +Kernel Version: 2.6.30 +Contact: iss_storagedev@hp.com +Description: Displays the SCSI INQUIRY page 0 revision for logical + drive Y of controller X. + +Where: /sys/bus/pci/devices//ccissX/cXdY/unique_id +Date: March 2009 +Kernel Version: 2.6.30 +Contact: iss_storagedev@hp.com +Description: Displays the SCSI INQUIRY page 83 serial number for logical + drive Y of controller X. + +Where: /sys/bus/pci/devices//ccissX/cXdY/vendor +Date: March 2009 +Kernel Version: 2.6.30 +Contact: iss_storagedev@hp.com +Description: Displays the SCSI INQUIRY page 0 vendor for logical drive + Y of controller X. + +Where: /sys/bus/pci/devices//ccissX/cXdY/block:cciss!cXdY +Date: March 2009 +Kernel Version: 2.6.30 +Contact: iss_storagedev@hp.com +Description: A symbolic link to /sys/block/cciss!cXdY diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c index cb43fb3af159..e7d00952dd4f 100644 --- a/drivers/block/cciss.c +++ b/drivers/block/cciss.c @@ -437,6 +437,194 @@ static void __devinit cciss_procinit(int i) } #endif /* CONFIG_PROC_FS */ +#define MAX_PRODUCT_NAME_LEN 19 + +#define to_hba(n) container_of(n, struct ctlr_info, dev) +#define to_drv(n) container_of(n, drive_info_struct, dev) + +static struct device_type cciss_host_type = { + .name = "cciss_host", +}; + +static ssize_t dev_show_unique_id(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + drive_info_struct *drv = to_drv(dev); + struct ctlr_info *h = to_hba(drv->dev.parent); + __u8 sn[16]; + unsigned long flags; + int ret = 0; + + spin_lock_irqsave(CCISS_LOCK(h->ctlr), flags); + if (h->busy_configuring) + ret = -EBUSY; + else + memcpy(sn, drv->serial_no, sizeof(sn)); + spin_unlock_irqrestore(CCISS_LOCK(h->ctlr), flags); + + if (ret) + return ret; + else + return snprintf(buf, 16 * 2 + 2, + "%02X%02X%02X%02X%02X%02X%02X%02X" + "%02X%02X%02X%02X%02X%02X%02X%02X\n", + sn[0], sn[1], sn[2], sn[3], + sn[4], sn[5], sn[6], sn[7], + sn[8], sn[9], sn[10], sn[11], + sn[12], sn[13], sn[14], sn[15]); +} +DEVICE_ATTR(unique_id, S_IRUGO, dev_show_unique_id, NULL); + +static ssize_t dev_show_vendor(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + drive_info_struct *drv = to_drv(dev); + struct ctlr_info *h = to_hba(drv->dev.parent); + char vendor[VENDOR_LEN + 1]; + unsigned long flags; + int ret = 0; + + spin_lock_irqsave(CCISS_LOCK(h->ctlr), flags); + if (h->busy_configuring) + ret = -EBUSY; + else + memcpy(vendor, drv->vendor, VENDOR_LEN + 1); + spin_unlock_irqrestore(CCISS_LOCK(h->ctlr), flags); + + if (ret) + return ret; + else + return snprintf(buf, sizeof(vendor) + 1, "%s\n", drv->vendor); +} +DEVICE_ATTR(vendor, S_IRUGO, dev_show_vendor, NULL); + +static ssize_t dev_show_model(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + drive_info_struct *drv = to_drv(dev); + struct ctlr_info *h = to_hba(drv->dev.parent); + char model[MODEL_LEN + 1]; + unsigned long flags; + int ret = 0; + + spin_lock_irqsave(CCISS_LOCK(h->ctlr), flags); + if (h->busy_configuring) + ret = -EBUSY; + else + memcpy(model, drv->model, MODEL_LEN + 1); + spin_unlock_irqrestore(CCISS_LOCK(h->ctlr), flags); + + if (ret) + return ret; + else + return snprintf(buf, sizeof(model) + 1, "%s\n", drv->model); +} +DEVICE_ATTR(model, S_IRUGO, dev_show_model, NULL); + +static ssize_t dev_show_rev(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + drive_info_struct *drv = to_drv(dev); + struct ctlr_info *h = to_hba(drv->dev.parent); + char rev[REV_LEN + 1]; + unsigned long flags; + int ret = 0; + + spin_lock_irqsave(CCISS_LOCK(h->ctlr), flags); + if (h->busy_configuring) + ret = -EBUSY; + else + memcpy(rev, drv->rev, REV_LEN + 1); + spin_unlock_irqrestore(CCISS_LOCK(h->ctlr), flags); + + if (ret) + return ret; + else + return snprintf(buf, sizeof(rev) + 1, "%s\n", drv->rev); +} +DEVICE_ATTR(rev, S_IRUGO, dev_show_rev, NULL); + +static struct attribute *cciss_dev_attrs[] = { + &dev_attr_unique_id.attr, + &dev_attr_model.attr, + &dev_attr_vendor.attr, + &dev_attr_rev.attr, + NULL +}; + +static struct attribute_group cciss_dev_attr_group = { + .attrs = cciss_dev_attrs, +}; + +static struct attribute_group *cciss_dev_attr_groups[] = { + &cciss_dev_attr_group, + NULL +}; + +static struct device_type cciss_dev_type = { + .name = "cciss_device", + .groups = cciss_dev_attr_groups, +}; + +static struct bus_type cciss_bus_type = { + .name = "cciss", +}; + + +/* + * Initialize sysfs entry for each controller. This sets up and registers + * the 'cciss#' directory for each individual controller under + * /sys/bus/pci/devices//. + */ +static int cciss_create_hba_sysfs_entry(struct ctlr_info *h) +{ + device_initialize(&h->dev); + h->dev.type = &cciss_host_type; + h->dev.bus = &cciss_bus_type; + dev_set_name(&h->dev, "%s", h->devname); + h->dev.parent = &h->pdev->dev; + + return device_add(&h->dev); +} + +/* + * Remove sysfs entries for an hba. + */ +static void cciss_destroy_hba_sysfs_entry(struct ctlr_info *h) +{ + device_del(&h->dev); +} + +/* + * Initialize sysfs for each logical drive. This sets up and registers + * the 'c#d#' directory for each individual logical drive under + * /sys/bus/pci/devices/dev); + drv->dev.type = &cciss_dev_type; + drv->dev.bus = &cciss_bus_type; + dev_set_name(&drv->dev, "c%dd%d", h->ctlr, drv_index); + drv->dev.parent = &h->dev; + return device_add(&drv->dev); +} + +/* + * Remove sysfs entries for a logical drive. + */ +static void cciss_destroy_ld_sysfs_entry(drive_info_struct *drv) +{ + device_del(&drv->dev); +} + /* * For operations that cannot sleep, a command block is allocated at init, * and managed by cmd_alloc() and cmd_free() using a simple bitmap to track @@ -1332,6 +1520,45 @@ static void cciss_softirq_done(struct request *rq) spin_unlock_irqrestore(&h->lock, flags); } +/* This function gets the SCSI vendor, model, and revision of a logical drive + * via the inquiry page 0. Model, vendor, and rev are set to empty strings if + * they cannot be read. + */ +static void cciss_get_device_descr(int ctlr, int logvol, int withirq, + char *vendor, char *model, char *rev) +{ + int rc; + InquiryData_struct *inq_buf; + + *vendor = '\0'; + *model = '\0'; + *rev = '\0'; + + inq_buf = kzalloc(sizeof(InquiryData_struct), GFP_KERNEL); + if (!inq_buf) + return; + + if (withirq) + rc = sendcmd_withirq(CISS_INQUIRY, ctlr, inq_buf, + sizeof(InquiryData_struct), 1, logvol, + 0, TYPE_CMD); + else + rc = sendcmd(CISS_INQUIRY, ctlr, inq_buf, + sizeof(InquiryData_struct), 1, logvol, 0, NULL, + TYPE_CMD); + if (rc == IO_OK) { + memcpy(vendor, &inq_buf->data_byte[8], VENDOR_LEN); + vendor[VENDOR_LEN] = '\0'; + memcpy(model, &inq_buf->data_byte[16], MODEL_LEN); + model[MODEL_LEN] = '\0'; + memcpy(rev, &inq_buf->data_byte[32], REV_LEN); + rev[REV_LEN] = '\0'; + } + + kfree(inq_buf); + return; +} + /* This function gets the serial number of a logical drive via * inquiry page 0x83. Serial no. is 16 bytes. If the serial * number cannot be had, for whatever reason, 16 bytes of 0xff @@ -1372,7 +1599,7 @@ static void cciss_add_disk(ctlr_info_t *h, struct gendisk *disk, disk->first_minor = drv_index << NWD_SHIFT; disk->fops = &cciss_fops; disk->private_data = &h->drv[drv_index]; - disk->driverfs_dev = &h->pdev->dev; + disk->driverfs_dev = &h->drv[drv_index].dev; /* Set up queue information */ blk_queue_bounce_limit(disk->queue, h->pdev->dma_mask); @@ -1463,6 +1690,8 @@ static void cciss_update_drive_info(int ctlr, int drv_index, int first_time) drvinfo->block_size = block_size; drvinfo->nr_blocks = total_size + 1; + cciss_get_device_descr(ctlr, drv_index, 1, drvinfo->vendor, + drvinfo->model, drvinfo->rev); cciss_get_serial_no(ctlr, drv_index, 1, drvinfo->serial_no, sizeof(drvinfo->serial_no)); @@ -1512,6 +1741,9 @@ static void cciss_update_drive_info(int ctlr, int drv_index, int first_time) h->drv[drv_index].cylinders = drvinfo->cylinders; h->drv[drv_index].raid_level = drvinfo->raid_level; memcpy(h->drv[drv_index].serial_no, drvinfo->serial_no, 16); + memcpy(h->drv[drv_index].vendor, drvinfo->vendor, VENDOR_LEN + 1); + memcpy(h->drv[drv_index].model, drvinfo->model, MODEL_LEN + 1); + memcpy(h->drv[drv_index].rev, drvinfo->rev, REV_LEN + 1); ++h->num_luns; disk = h->gendisk[drv_index]; @@ -1586,6 +1818,8 @@ static int cciss_add_gendisk(ctlr_info_t *h, __u32 lunid, int controller_node) } } h->drv[drv_index].LunID = lunid; + if (cciss_create_ld_sysfs_entry(h, &h->drv[drv_index], drv_index)) + goto err_free_disk; /* Don't need to mark this busy because nobody */ /* else knows about this disk yet to contend */ @@ -1593,6 +1827,11 @@ static int cciss_add_gendisk(ctlr_info_t *h, __u32 lunid, int controller_node) h->drv[drv_index].busy_configuring = 0; wmb(); return drv_index; + +err_free_disk: + put_disk(h->gendisk[drv_index]); + h->gendisk[drv_index] = NULL; + return -1; } /* This is for the special case of a controller which @@ -1713,6 +1952,7 @@ static int rebuild_lun_table(ctlr_info_t *h, int first_time) h->drv[i].busy_configuring = 1; spin_unlock_irqrestore(CCISS_LOCK(h->ctlr), flags); return_code = deregister_disk(h, i, 1); + cciss_destroy_ld_sysfs_entry(&h->drv[i]); h->drv[i].busy_configuring = 0; } } @@ -3719,12 +3959,15 @@ static int __devinit cciss_init_one(struct pci_dev *pdev, INIT_HLIST_HEAD(&hba[i]->reqQ); if (cciss_pci_init(hba[i], pdev) != 0) - goto clean1; + goto clean0; sprintf(hba[i]->devname, "cciss%d", i); hba[i]->ctlr = i; hba[i]->pdev = pdev; + if (cciss_create_hba_sysfs_entry(hba[i])) + goto clean0; + /* configure PCI DMA stuff */ if (!pci_set_dma_mask(pdev, DMA_BIT_MASK(64))) dac = 1; @@ -3868,6 +4111,8 @@ clean4: clean2: unregister_blkdev(hba[i]->major, hba[i]->devname); clean1: + cciss_destroy_hba_sysfs_entry(hba[i]); +clean0: hba[i]->busy_initializing = 0; /* cleanup any queues that may have been initialized */ for (j=0; j <= hba[i]->highest_lun; j++){ @@ -3978,6 +4223,7 @@ static void __devexit cciss_remove_one(struct pci_dev *pdev) */ pci_release_regions(pdev); pci_set_drvdata(pdev, NULL); + cciss_destroy_hba_sysfs_entry(hba[i]); free_hba(i); } @@ -3995,6 +4241,8 @@ static struct pci_driver cciss_pci_driver = { */ static int __init cciss_init(void) { + int err; + /* * The hardware requires that commands are aligned on a 64-bit * boundary. Given that we use pci_alloc_consistent() to allocate an @@ -4004,8 +4252,20 @@ static int __init cciss_init(void) printk(KERN_INFO DRIVER_NAME "\n"); + err = bus_register(&cciss_bus_type); + if (err) + return err; + /* Register for our PCI devices */ - return pci_register_driver(&cciss_pci_driver); + err = pci_register_driver(&cciss_pci_driver); + if (err) + goto err_bus_register; + + return 0; + +err_bus_register: + bus_unregister(&cciss_bus_type); + return err; } static void __exit cciss_cleanup(void) @@ -4022,6 +4282,7 @@ static void __exit cciss_cleanup(void) } } remove_proc_entry("driver/cciss", NULL); + bus_unregister(&cciss_bus_type); } static void fail_all_cmds(unsigned long ctlr) diff --git a/drivers/block/cciss.h b/drivers/block/cciss.h index 703e08038fb9..dd1926d8cd97 100644 --- a/drivers/block/cciss.h +++ b/drivers/block/cciss.h @@ -12,6 +12,10 @@ #define IO_OK 0 #define IO_ERROR 1 +#define VENDOR_LEN 8 +#define MODEL_LEN 16 +#define REV_LEN 4 + struct ctlr_info; typedef struct ctlr_info ctlr_info_t; @@ -34,13 +38,18 @@ typedef struct _drive_info_struct int cylinders; int raid_level; /* set to -1 to indicate that * the drive is not in use/configured - */ - int busy_configuring; /*This is set when the drive is being removed - *to prevent it from being opened or it's queue - *from being started. - */ - __u8 serial_no[16]; /* from inquiry page 0x83, */ - /* not necc. null terminated. */ + */ + int busy_configuring; /* This is set when a drive is being removed + * to prevent it from being opened or it's + * queue from being started. + */ + struct device dev; + __u8 serial_no[16]; /* from inquiry page 0x83, + * not necc. null terminated. + */ + char vendor[VENDOR_LEN + 1]; /* SCSI vendor string */ + char model[MODEL_LEN + 1]; /* SCSI model string */ + char rev[REV_LEN + 1]; /* SCSI revision string */ } drive_info_struct; #ifdef CONFIG_CISS_SCSI_TAPE @@ -123,6 +132,7 @@ struct ctlr_info unsigned char alive; struct completion *rescan_wait; struct task_struct *cciss_scan_thread; + struct device dev; }; /* Defining the diffent access_menthods */ -- cgit v1.2.3 From dbdc9dd342f0a7e32f40f0d4ade662bdfe057484 Mon Sep 17 00:00:00 2001 From: vibi sreenivasan Date: Tue, 2 Jun 2009 14:52:32 +0200 Subject: Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt File Documentation/PCI/PCI-DMA-mapping.txt does not exist. Documentation/DMA-mapping.txt contains DMA Mapping details Signed-off-by: vibi sreenivasan Signed-off-by: Jens Axboe --- Documentation/block/biodoc.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt index 6fab97ea7e6b..8d2158a1c6aa 100644 --- a/Documentation/block/biodoc.txt +++ b/Documentation/block/biodoc.txt @@ -186,7 +186,7 @@ a virtual address mapping (unlike the earlier scheme of virtual address do not have a corresponding kernel virtual address space mapping) and low-memory pages. -Note: Please refer to Documentation/PCI/PCI-DMA-mapping.txt for a discussion +Note: Please refer to Documentation/DMA-mapping.txt for a discussion on PCI high mem DMA aspects and mapping of scatter gather lists, and support for 64 bit PCI. -- cgit v1.2.3 From 1745de5e5639457513fe43440f2800e23c3cbc7d Mon Sep 17 00:00:00 2001 From: Joerg Roedel Date: Fri, 22 May 2009 21:49:51 +0200 Subject: dma-debug: add dma_debug_driver kernel command line This patch add the dma_debug_driver= boot parameter to enable the driver filter for early boot. Signed-off-by: Joerg Roedel --- Documentation/kernel-parameters.txt | 7 +++++++ lib/dma-debug.c | 18 ++++++++++++++++++ 2 files changed, 25 insertions(+) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index e87bdbfbcc75..b3f1314588c9 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -646,6 +646,13 @@ and is between 256 and 4096 characters. It is defined in the file DMA-API debugging code disables itself because the architectural default is too low. + dma_debug_driver= + With this option the DMA-API debugging driver + filter feature can be enabled at boot time. Just + pass the driver to filter for as the parameter. + The filter can be disabled or changed to another + driver later using sysfs. + dscc4.setup= [NET] dtc3181e= [HW,SCSI] diff --git a/lib/dma-debug.c b/lib/dma-debug.c index c6330b1a7be7..d0618aa13b49 100644 --- a/lib/dma-debug.c +++ b/lib/dma-debug.c @@ -1109,3 +1109,21 @@ void debug_dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, } EXPORT_SYMBOL(debug_dma_sync_sg_for_device); +static int __init dma_debug_driver_setup(char *str) +{ + int i; + + for (i = 0; i < NAME_MAX_LEN - 1; ++i, ++str) { + current_driver_name[i] = *str; + if (*str == 0) + break; + } + + if (current_driver_name[0]) + printk(KERN_INFO "DMA-API: enable driver filter for " + "driver [%s]\n", current_driver_name); + + + return 1; +} +__setup("dma_debug_driver=", dma_debug_driver_setup); -- cgit v1.2.3 From 016ea6874a6d58df85b54f56997d26df13c307b2 Mon Sep 17 00:00:00 2001 From: Joerg Roedel Date: Fri, 22 May 2009 21:57:23 +0200 Subject: dma-debug: add documentation for the driver filter This patch adds the driver filter feature to the dma-debug documentation. Signed-off-by: Joerg Roedel --- Documentation/DMA-API.txt | 12 ++++++++++++ 1 file changed, 12 insertions(+) (limited to 'Documentation') diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt index d9aa43d78bcc..25fb8bcf32a2 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/DMA-API.txt @@ -704,12 +704,24 @@ this directory the following files can currently be found: The current number of free dma_debug_entries in the allocator. + dma-api/driver-filter + You can write a name of a driver into this file + to limit the debug output to requests from that + particular driver. Write an empty string to + that file to disable the filter and see + all errors again. + If you have this code compiled into your kernel it will be enabled by default. If you want to boot without the bookkeeping anyway you can provide 'dma_debug=off' as a boot parameter. This will disable DMA-API debugging. Notice that you can not enable it again at runtime. You have to reboot to do so. +If you want to see debug messages only for a special device driver you can +specify the dma_debug_driver= parameter. This will enable the +driver filter at boot time. The debug code will only print errors for that +driver afterwards. This filter can be disabled or changed later using debugfs. + When the code disables itself at runtime this is most likely because it ran out of dma_debug_entries. These entries are preallocated at boot. The number of preallocated entries is defined per architecture. If it is too low for you -- cgit v1.2.3 From bc5c6c043d8381676339fb3da59cc4cc5921d368 Mon Sep 17 00:00:00 2001 From: Mike Frysinger Date: Wed, 10 Jun 2009 04:48:41 -0400 Subject: ftrace/documentation: fix typo in function grapher name The function graph tracer is called just "function_graph" (no trailing "_tracer" needed). Signed-off-by: Mike Frysinger LKML-Reference: <1244623722-6325-1-git-send-email-vapier@gentoo.org> Signed-off-by: Steven Rostedt --- Documentation/trace/ftrace.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt index fd9a3e693813..5ad2ded8aa63 100644 --- a/Documentation/trace/ftrace.txt +++ b/Documentation/trace/ftrace.txt @@ -179,7 +179,7 @@ Here is the list of current tracers that may be configured. Function call tracer to trace all kernel functions. - "function_graph_tracer" + "function_graph" Similar to the function tracer except that the function tracer probes the functions on their entry -- cgit v1.2.3 From 04f70336c80c43a15e617b36c2043dfa0ad6ed0f Mon Sep 17 00:00:00 2001 From: Catalin Marinas Date: Thu, 11 Jun 2009 13:22:39 +0100 Subject: kmemleak: Add documentation on the memory leak detector This patch adds the Documentation/kmemleak.txt file with some information about how kmemleak works. Signed-off-by: Catalin Marinas --- Documentation/kernel-parameters.txt | 4 + Documentation/kmemleak.txt | 142 ++++++++++++++++++++++++++++++++++++ 2 files changed, 146 insertions(+) create mode 100644 Documentation/kmemleak.txt (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 4a3c2209a124..04a44cc5048a 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1077,6 +1077,10 @@ and is between 256 and 4096 characters. It is defined in the file Configure the RouterBoard 532 series on-chip Ethernet adapter MAC address. + kmemleak= [KNL] Boot-time kmemleak enable/disable + Valid arguments: on, off + Default: on + kstack=N [X86] Print N words from the kernel stack in oops dumps. diff --git a/Documentation/kmemleak.txt b/Documentation/kmemleak.txt new file mode 100644 index 000000000000..0112da3b9ab8 --- /dev/null +++ b/Documentation/kmemleak.txt @@ -0,0 +1,142 @@ +Kernel Memory Leak Detector +=========================== + +Introduction +------------ + +Kmemleak provides a way of detecting possible kernel memory leaks in a +way similar to a tracing garbage collector +(http://en.wikipedia.org/wiki/Garbage_collection_%28computer_science%29#Tracing_garbage_collectors), +with the difference that the orphan objects are not freed but only +reported via /sys/kernel/debug/kmemleak. A similar method is used by the +Valgrind tool (memcheck --leak-check) to detect the memory leaks in +user-space applications. + +Usage +----- + +CONFIG_DEBUG_KMEMLEAK in "Kernel hacking" has to be enabled. A kernel +thread scans the memory every 10 minutes (by default) and prints any new +unreferenced objects found. To trigger an intermediate scan and display +all the possible memory leaks: + + # mount -t debugfs nodev /sys/kernel/debug/ + # cat /sys/kernel/debug/kmemleak + +Note that the orphan objects are listed in the order they were allocated +and one object at the beginning of the list may cause other subsequent +objects to be reported as orphan. + +Memory scanning parameters can be modified at run-time by writing to the +/sys/kernel/debug/kmemleak file. The following parameters are supported: + + off - disable kmemleak (irreversible) + stack=on - enable the task stacks scanning + stack=off - disable the tasks stacks scanning + scan=on - start the automatic memory scanning thread + scan=off - stop the automatic memory scanning thread + scan= - set the automatic memory scanning period in seconds (0 + to disable it) + +Kmemleak can also be disabled at boot-time by passing "kmemleak=off" on +the kernel command line. + +Basic Algorithm +--------------- + +The memory allocations via kmalloc, vmalloc, kmem_cache_alloc and +friends are traced and the pointers, together with additional +information like size and stack trace, are stored in a prio search tree. +The corresponding freeing function calls are tracked and the pointers +removed from the kmemleak data structures. + +An allocated block of memory is considered orphan if no pointer to its +start address or to any location inside the block can be found by +scanning the memory (including saved registers). This means that there +might be no way for the kernel to pass the address of the allocated +block to a freeing function and therefore the block is considered a +memory leak. + +The scanning algorithm steps: + + 1. mark all objects as white (remaining white objects will later be + considered orphan) + 2. scan the memory starting with the data section and stacks, checking + the values against the addresses stored in the prio search tree. If + a pointer to a white object is found, the object is added to the + gray list + 3. scan the gray objects for matching addresses (some white objects + can become gray and added at the end of the gray list) until the + gray set is finished + 4. the remaining white objects are considered orphan and reported via + /sys/kernel/debug/kmemleak + +Some allocated memory blocks have pointers stored in the kernel's +internal data structures and they cannot be detected as orphans. To +avoid this, kmemleak can also store the number of values pointing to an +address inside the block address range that need to be found so that the +block is not considered a leak. One example is __vmalloc(). + +Kmemleak API +------------ + +See the include/linux/kmemleak.h header for the functions prototype. + +kmemleak_init - initialize kmemleak +kmemleak_alloc - notify of a memory block allocation +kmemleak_free - notify of a memory block freeing +kmemleak_not_leak - mark an object as not a leak +kmemleak_ignore - do not scan or report an object as leak +kmemleak_scan_area - add scan areas inside a memory block +kmemleak_no_scan - do not scan a memory block +kmemleak_erase - erase an old value in a pointer variable +kmemleak_alloc_recursive - as kmemleak_alloc but checks the recursiveness +kmemleak_free_recursive - as kmemleak_free but checks the recursiveness + +Dealing with false positives/negatives +-------------------------------------- + +The false negatives are real memory leaks (orphan objects) but not +reported by kmemleak because values found during the memory scanning +point to such objects. To reduce the number of false negatives, kmemleak +provides the kmemleak_ignore, kmemleak_scan_area, kmemleak_no_scan and +kmemleak_erase functions (see above). The task stacks also increase the +amount of false negatives and their scanning is not enabled by default. + +The false positives are objects wrongly reported as being memory leaks +(orphan). For objects known not to be leaks, kmemleak provides the +kmemleak_not_leak function. The kmemleak_ignore could also be used if +the memory block is known not to contain other pointers and it will no +longer be scanned. + +Some of the reported leaks are only transient, especially on SMP +systems, because of pointers temporarily stored in CPU registers or +stacks. Kmemleak defines MSECS_MIN_AGE (defaulting to 1000) representing +the minimum age of an object to be reported as a memory leak. + +Limitations and Drawbacks +------------------------- + +The main drawback is the reduced performance of memory allocation and +freeing. To avoid other penalties, the memory scanning is only performed +when the /sys/kernel/debug/kmemleak file is read. Anyway, this tool is +intended for debugging purposes where the performance might not be the +most important requirement. + +To keep the algorithm simple, kmemleak scans for values pointing to any +address inside a block's address range. This may lead to an increased +number of false negatives. However, it is likely that a real memory leak +will eventually become visible. + +Another source of false negatives is the data stored in non-pointer +values. In a future version, kmemleak could only scan the pointer +members in the allocated structures. This feature would solve many of +the false negative cases described above. + +The tool can report false positives. These are cases where an allocated +block doesn't need to be freed (some cases in the init_call functions), +the pointer is calculated by other methods than the usual container_of +macro or the pointer is stored in a location not scanned by kmemleak. + +Page allocations and ioremap are not tracked. Only the ARM and x86 +architectures are currently supported. -- cgit v1.2.3