From e5def4c6039e9b7bb158c6ec6d8fedda9265b7ef Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:15 +0800 Subject: Documentation: add Linux x86 docs to Sphinx TOC tree Add a index.rst for x86 support. More docs will be added later. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/index.rst | 9 +++++++++ 1 file changed, 9 insertions(+) create mode 100644 Documentation/x86/index.rst (limited to 'Documentation/x86') diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst new file mode 100644 index 000000000000..9f34545a9c52 --- /dev/null +++ b/Documentation/x86/index.rst @@ -0,0 +1,9 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +x86-specific Documentation +========================== + +.. toctree:: + :maxdepth: 2 + :numbered: -- cgit v1.2.3 From f1f238a9f1ca1f1c7f53ebb7d5517066bf40c471 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:16 +0800 Subject: Documentation: x86: convert boot.txt to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Cc: Mauro Carvalho Chehab Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/boot.rst | 1256 +++++++++++++++++++++++++++++++++++++++++++ Documentation/x86/boot.txt | 1134 -------------------------------------- Documentation/x86/index.rst | 2 + 3 files changed, 1258 insertions(+), 1134 deletions(-) create mode 100644 Documentation/x86/boot.rst delete mode 100644 Documentation/x86/boot.txt (limited to 'Documentation/x86') diff --git a/Documentation/x86/boot.rst b/Documentation/x86/boot.rst new file mode 100644 index 000000000000..08a2f100c0e6 --- /dev/null +++ b/Documentation/x86/boot.rst @@ -0,0 +1,1256 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +The Linux/x86 Boot Protocol +=========================== + +On the x86 platform, the Linux kernel uses a rather complicated boot +convention. This has evolved partially due to historical aspects, as +well as the desire in the early days to have the kernel itself be a +bootable image, the complicated PC memory model and due to changed +expectations in the PC industry caused by the effective demise of +real-mode DOS as a mainstream operating system. + +Currently, the following versions of the Linux/x86 boot protocol exist. + +============= ============================================================ +Old kernels zImage/Image support only. Some very early kernels + may not even support a command line. + +Protocol 2.00 (Kernel 1.3.73) Added bzImage and initrd support, as + well as a formalized way to communicate between the + boot loader and the kernel. setup.S made relocatable, + although the traditional setup area still assumed + writable. + +Protocol 2.01 (Kernel 1.3.76) Added a heap overrun warning. + +Protocol 2.02 (Kernel 2.4.0-test3-pre3) New command line protocol. + Lower the conventional memory ceiling. No overwrite + of the traditional setup area, thus making booting + safe for systems which use the EBDA from SMM or 32-bit + BIOS entry points. zImage deprecated but still + supported. + +Protocol 2.03 (Kernel 2.4.18-pre1) Explicitly makes the highest possible + initrd address available to the bootloader. + +Protocol 2.04 (Kernel 2.6.14) Extend the syssize field to four bytes. + +Protocol 2.05 (Kernel 2.6.20) Make protected mode kernel relocatable. + Introduce relocatable_kernel and kernel_alignment fields. + +Protocol 2.06 (Kernel 2.6.22) Added a field that contains the size of + the boot command line. + +Protocol 2.07 (Kernel 2.6.24) Added paravirtualised boot protocol. + Introduced hardware_subarch and hardware_subarch_data + and KEEP_SEGMENTS flag in load_flags. + +Protocol 2.08 (Kernel 2.6.26) Added crc32 checksum and ELF format + payload. Introduced payload_offset and payload_length + fields to aid in locating the payload. + +Protocol 2.09 (Kernel 2.6.26) Added a field of 64-bit physical + pointer to single linked list of struct setup_data. + +Protocol 2.10 (Kernel 2.6.31) Added a protocol for relaxed alignment + beyond the kernel_alignment added, new init_size and + pref_address fields. Added extended boot loader IDs. + +Protocol 2.11 (Kernel 3.6) Added a field for offset of EFI handover + protocol entry point. + +Protocol 2.12 (Kernel 3.8) Added the xloadflags field and extension fields + to struct boot_params for loading bzImage and ramdisk + above 4G in 64bit. + +Protocol 2.13 (Kernel 3.14) Support 32- and 64-bit flags being set in + xloadflags to support booting a 64-bit kernel from 32-bit + EFI +============= ============================================================ + + +Memory Layout +============= + +The traditional memory map for the kernel loader, used for Image or +zImage kernels, typically looks like:: + + | | + 0A0000 +------------------------+ + | Reserved for BIOS | Do not use. Reserved for BIOS EBDA. + 09A000 +------------------------+ + | Command line | + | Stack/heap | For use by the kernel real-mode code. + 098000 +------------------------+ + | Kernel setup | The kernel real-mode code. + 090200 +------------------------+ + | Kernel boot sector | The kernel legacy boot sector. + 090000 +------------------------+ + | Protected-mode kernel | The bulk of the kernel image. + 010000 +------------------------+ + | Boot loader | <- Boot sector entry point 0000:7C00 + 001000 +------------------------+ + | Reserved for MBR/BIOS | + 000800 +------------------------+ + | Typically used by MBR | + 000600 +------------------------+ + | BIOS use only | + 000000 +------------------------+ + +When using bzImage, the protected-mode kernel was relocated to +0x100000 ("high memory"), and the kernel real-mode block (boot sector, +setup, and stack/heap) was made relocatable to any address between +0x10000 and end of low memory. Unfortunately, in protocols 2.00 and +2.01 the 0x90000+ memory range is still used internally by the kernel; +the 2.02 protocol resolves that problem. + +It is desirable to keep the "memory ceiling" -- the highest point in +low memory touched by the boot loader -- as low as possible, since +some newer BIOSes have begun to allocate some rather large amounts of +memory, called the Extended BIOS Data Area, near the top of low +memory. The boot loader should use the "INT 12h" BIOS call to verify +how much low memory is available. + +Unfortunately, if INT 12h reports that the amount of memory is too +low, there is usually nothing the boot loader can do but to report an +error to the user. The boot loader should therefore be designed to +take up as little space in low memory as it reasonably can. For +zImage or old bzImage kernels, which need data written into the +0x90000 segment, the boot loader should make sure not to use memory +above the 0x9A000 point; too many BIOSes will break above that point. + +For a modern bzImage kernel with boot protocol version >= 2.02, a +memory layout like the following is suggested:: + + ~ ~ + | Protected-mode kernel | + 100000 +------------------------+ + | I/O memory hole | + 0A0000 +------------------------+ + | Reserved for BIOS | Leave as much as possible unused + ~ ~ + | Command line | (Can also be below the X+10000 mark) + X+10000 +------------------------+ + | Stack/heap | For use by the kernel real-mode code. + X+08000 +------------------------+ + | Kernel setup | The kernel real-mode code. + | Kernel boot sector | The kernel legacy boot sector. + X +------------------------+ + | Boot loader | <- Boot sector entry point 0000:7C00 + 001000 +------------------------+ + | Reserved for MBR/BIOS | + 000800 +------------------------+ + | Typically used by MBR | + 000600 +------------------------+ + | BIOS use only | + 000000 +------------------------+ + + ... where the address X is as low as the design of the boot loader permits. + + +The Real-Mode Kernel Header +=========================== + +In the following text, and anywhere in the kernel boot sequence, "a +sector" refers to 512 bytes. It is independent of the actual sector +size of the underlying medium. + +The first step in loading a Linux kernel should be to load the +real-mode code (boot sector and setup code) and then examine the +following header at offset 0x01f1. The real-mode code can total up to +32K, although the boot loader may choose to load only the first two +sectors (1K) and then examine the bootup sector size. + +The header looks like: + +=========== ======== ===================== ============================================ +Offset/Size Proto Name Meaning +=========== ======== ===================== ============================================ +01F1/1 ALL(1) setup_sects The size of the setup in sectors +01F2/2 ALL root_flags If set, the root is mounted readonly +01F4/4 2.04+(2) syssize The size of the 32-bit code in 16-byte paras +01F8/2 ALL ram_size DO NOT USE - for bootsect.S use only +01FA/2 ALL vid_mode Video mode control +01FC/2 ALL root_dev Default root device number +01FE/2 ALL boot_flag 0xAA55 magic number +0200/2 2.00+ jump Jump instruction +0202/4 2.00+ header Magic signature "HdrS" +0206/2 2.00+ version Boot protocol version supported +0208/4 2.00+ realmode_swtch Boot loader hook (see below) +020C/2 2.00+ start_sys_seg The load-low segment (0x1000) (obsolete) +020E/2 2.00+ kernel_version Pointer to kernel version string +0210/1 2.00+ type_of_loader Boot loader identifier +0211/1 2.00+ loadflags Boot protocol option flags +0212/2 2.00+ setup_move_size Move to high memory size (used with hooks) +0214/4 2.00+ code32_start Boot loader hook (see below) +0218/4 2.00+ ramdisk_image initrd load address (set by boot loader) +021C/4 2.00+ ramdisk_size initrd size (set by boot loader) +0220/4 2.00+ bootsect_kludge DO NOT USE - for bootsect.S use only +0224/2 2.01+ heap_end_ptr Free memory after setup end +0226/1 2.02+(3) ext_loader_ver Extended boot loader version +0227/1 2.02+(3) ext_loader_type Extended boot loader ID +0228/4 2.02+ cmd_line_ptr 32-bit pointer to the kernel command line +022C/4 2.03+ initrd_addr_max Highest legal initrd address +0230/4 2.05+ kernel_alignment Physical addr alignment required for kernel +0234/1 2.05+ relocatable_kernel Whether kernel is relocatable or not +0235/1 2.10+ min_alignment Minimum alignment, as a power of two +0236/2 2.12+ xloadflags Boot protocol option flags +0238/4 2.06+ cmdline_size Maximum size of the kernel command line +023C/4 2.07+ hardware_subarch Hardware subarchitecture +0240/8 2.07+ hardware_subarch_data Subarchitecture-specific data +0248/4 2.08+ payload_offset Offset of kernel payload +024C/4 2.08+ payload_length Length of kernel payload +0250/8 2.09+ setup_data 64-bit physical pointer to linked list + of struct setup_data +0258/8 2.10+ pref_address Preferred loading address +0260/4 2.10+ init_size Linear memory required during initialization +0264/4 2.11+ handover_offset Offset of handover entry point +=========== ======== ===================== ============================================ + +.. note:: + (1) For backwards compatibility, if the setup_sects field contains 0, the + real value is 4. + + (2) For boot protocol prior to 2.04, the upper two bytes of the syssize + field are unusable, which means the size of a bzImage kernel + cannot be determined. + + (3) Ignored, but safe to set, for boot protocols 2.02-2.09. + +If the "HdrS" (0x53726448) magic number is not found at offset 0x202, +the boot protocol version is "old". Loading an old kernel, the +following parameters should be assumed:: + + Image type = zImage + initrd not supported + Real-mode kernel must be located at 0x90000. + +Otherwise, the "version" field contains the protocol version, +e.g. protocol version 2.01 will contain 0x0201 in this field. When +setting fields in the header, you must make sure only to set fields +supported by the protocol version in use. + + +Details of Harder Fileds +======================== + +For each field, some are information from the kernel to the bootloader +("read"), some are expected to be filled out by the bootloader +("write"), and some are expected to be read and modified by the +bootloader ("modify"). + +All general purpose boot loaders should write the fields marked +(obligatory). Boot loaders who want to load the kernel at a +nonstandard address should fill in the fields marked (reloc); other +boot loaders can ignore those fields. + +The byte order of all fields is littleendian (this is x86, after all.) + +============ =========== +Field name: setup_sects +Type: read +Offset/size: 0x1f1/1 +Protocol: ALL +============ =========== + + The size of the setup code in 512-byte sectors. If this field is + 0, the real value is 4. The real-mode code consists of the boot + sector (always one 512-byte sector) plus the setup code. + +============ ================= +Field name: root_flags +Type: modify (optional) +Offset/size: 0x1f2/2 +Protocol: ALL +============ ================= + + If this field is nonzero, the root defaults to readonly. The use of + this field is deprecated; use the "ro" or "rw" options on the + command line instead. + +============ =============================================== +Field name: syssize +Type: read +Offset/size: 0x1f4/4 (protocol 2.04+) 0x1f4/2 (protocol ALL) +Protocol: 2.04+ +============ =============================================== + + The size of the protected-mode code in units of 16-byte paragraphs. + For protocol versions older than 2.04 this field is only two bytes + wide, and therefore cannot be trusted for the size of a kernel if + the LOAD_HIGH flag is set. + +============ =============== +Field name: ram_size +Type: kernel internal +Offset/size: 0x1f8/2 +Protocol: ALL +============ =============== + + This field is obsolete. + +============ =================== +Field name: vid_mode +Type: modify (obligatory) +Offset/size: 0x1fa/2 +============ =================== + + Please see the section on SPECIAL COMMAND LINE OPTIONS. + +============ ================= +Field name: root_dev +Type: modify (optional) +Offset/size: 0x1fc/2 +Protocol: ALL +============ ================= + + The default root device device number. The use of this field is + deprecated, use the "root=" option on the command line instead. + +============ ========= +Field name: boot_flag +Type: read +Offset/size: 0x1fe/2 +Protocol: ALL +============ ========= + + Contains 0xAA55. This is the closest thing old Linux kernels have + to a magic number. + +============ ======= +Field name: jump +Type: read +Offset/size: 0x200/2 +Protocol: 2.00+ +============ ======= + + Contains an x86 jump instruction, 0xEB followed by a signed offset + relative to byte 0x202. This can be used to determine the size of + the header. + +============ ======= +Field name: header +Type: read +Offset/size: 0x202/4 +Protocol: 2.00+ +============ ======= + + Contains the magic number "HdrS" (0x53726448). + +============ ======= +Field name: version +Type: read +Offset/size: 0x206/2 +Protocol: 2.00+ +============ ======= + + Contains the boot protocol version, in (major << 8)+minor format, + e.g. 0x0204 for version 2.04, and 0x0a11 for a hypothetical version + 10.17. + +============ ================= +Field name: realmode_swtch +Type: modify (optional) +Offset/size: 0x208/4 +Protocol: 2.00+ +============ ================= + + Boot loader hook (see ADVANCED BOOT LOADER HOOKS below.) + +============ ============= +Field name: start_sys_seg +Type: read +Offset/size: 0x20c/2 +Protocol: 2.00+ +============ ============= + + The load low segment (0x1000). Obsolete. + +============ ============== +Field name: kernel_version +Type: read +Offset/size: 0x20e/2 +Protocol: 2.00+ +============ ============== + + If set to a nonzero value, contains a pointer to a NUL-terminated + human-readable kernel version number string, less 0x200. This can + be used to display the kernel version to the user. This value + should be less than (0x200*setup_sects). + + For example, if this value is set to 0x1c00, the kernel version + number string can be found at offset 0x1e00 in the kernel file. + This is a valid value if and only if the "setup_sects" field + contains the value 15 or higher, as:: + + 0x1c00 < 15*0x200 (= 0x1e00) but + 0x1c00 >= 14*0x200 (= 0x1c00) + + 0x1c00 >> 9 = 14, So the minimum value for setup_secs is 15. + +============ ================== +Field name: type_of_loader +Type: write (obligatory) +Offset/size: 0x210/1 +Protocol: 2.00+ +============ ================== + + If your boot loader has an assigned id (see table below), enter + 0xTV here, where T is an identifier for the boot loader and V is + a version number. Otherwise, enter 0xFF here. + + For boot loader IDs above T = 0xD, write T = 0xE to this field and + write the extended ID minus 0x10 to the ext_loader_type field. + Similarly, the ext_loader_ver field can be used to provide more than + four bits for the bootloader version. + + For example, for T = 0x15, V = 0x234, write:: + + type_of_loader <- 0xE4 + ext_loader_type <- 0x05 + ext_loader_ver <- 0x23 + + Assigned boot loader ids (hexadecimal): + + == ======================================= + 0 LILO + (0x00 reserved for pre-2.00 bootloader) + 1 Loadlin + 2 bootsect-loader + (0x20, all other values reserved) + 3 Syslinux + 4 Etherboot/gPXE/iPXE + 5 ELILO + 7 GRUB + 8 U-Boot + 9 Xen + A Gujin + B Qemu + C Arcturus Networks uCbootloader + D kexec-tools + E Extended (see ext_loader_type) + F Special (0xFF = undefined) + 10 Reserved + 11 Minimal Linux Bootloader + + 12 OVMF UEFI virtualization stack + == ======================================= + + Please contact if you need a bootloader ID value assigned. + +============ =================== +Field name: loadflags +Type: modify (obligatory) +Offset/size: 0x211/1 +Protocol: 2.00+ +============ =================== + + This field is a bitmask. + + Bit 0 (read): LOADED_HIGH + + - If 0, the protected-mode code is loaded at 0x10000. + - If 1, the protected-mode code is loaded at 0x100000. + + Bit 1 (kernel internal): KASLR_FLAG + + - Used internally by the compressed kernel to communicate + KASLR status to kernel proper. + + - If 1, KASLR enabled. + - If 0, KASLR disabled. + + Bit 5 (write): QUIET_FLAG + + - If 0, print early messages. + - If 1, suppress early messages. + + This requests to the kernel (decompressor and early + kernel) to not write early messages that require + accessing the display hardware directly. + + Bit 6 (write): KEEP_SEGMENTS + + Protocol: 2.07+ + + - If 0, reload the segment registers in the 32bit entry point. + - If 1, do not reload the segment registers in the 32bit entry point. + + Assume that %cs %ds %ss %es are all set to flat segments with + a base of 0 (or the equivalent for their environment). + + Bit 7 (write): CAN_USE_HEAP + + Set this bit to 1 to indicate that the value entered in the + heap_end_ptr is valid. If this field is clear, some setup code + functionality will be disabled. + + +============ =================== +Field name: setup_move_size +Type: modify (obligatory) +Offset/size: 0x212/2 +Protocol: 2.00-2.01 +============ =================== + + When using protocol 2.00 or 2.01, if the real mode kernel is not + loaded at 0x90000, it gets moved there later in the loading + sequence. Fill in this field if you want additional data (such as + the kernel command line) moved in addition to the real-mode kernel + itself. + + The unit is bytes starting with the beginning of the boot sector. + + This field is can be ignored when the protocol is 2.02 or higher, or + if the real-mode code is loaded at 0x90000. + +============ ======================== +Field name: code32_start +Type: modify (optional, reloc) +Offset/size: 0x214/4 +Protocol: 2.00+ +============ ======================== + + The address to jump to in protected mode. This defaults to the load + address of the kernel, and can be used by the boot loader to + determine the proper load address. + + This field can be modified for two purposes: + + 1. as a boot loader hook (see Advanced Boot Loader Hooks below.) + + 2. if a bootloader which does not install a hook loads a + relocatable kernel at a nonstandard address it will have to modify + this field to point to the load address. + +============ ================== +Field name: ramdisk_image +Type: write (obligatory) +Offset/size: 0x218/4 +Protocol: 2.00+ +============ ================== + + The 32-bit linear address of the initial ramdisk or ramfs. Leave at + zero if there is no initial ramdisk/ramfs. + +============ ================== +Field name: ramdisk_size +Type: write (obligatory) +Offset/size: 0x21c/4 +Protocol: 2.00+ +============ ================== + + Size of the initial ramdisk or ramfs. Leave at zero if there is no + initial ramdisk/ramfs. + +============ =============== +Field name: bootsect_kludge +Type: kernel internal +Offset/size: 0x220/4 +Protocol: 2.00+ +============ =============== + + This field is obsolete. + +============ ================== +Field name: heap_end_ptr +Type: write (obligatory) +Offset/size: 0x224/2 +Protocol: 2.01+ +============ ================== + + Set this field to the offset (from the beginning of the real-mode + code) of the end of the setup stack/heap, minus 0x0200. + +============ ================ +Field name: ext_loader_ver +Type: write (optional) +Offset/size: 0x226/1 +Protocol: 2.02+ +============ ================ + + This field is used as an extension of the version number in the + type_of_loader field. The total version number is considered to be + (type_of_loader & 0x0f) + (ext_loader_ver << 4). + + The use of this field is boot loader specific. If not written, it + is zero. + + Kernels prior to 2.6.31 did not recognize this field, but it is safe + to write for protocol version 2.02 or higher. + +============ ===================================================== +Field name: ext_loader_type +Type: write (obligatory if (type_of_loader & 0xf0) == 0xe0) +Offset/size: 0x227/1 +Protocol: 2.02+ +============ ===================================================== + + This field is used as an extension of the type number in + type_of_loader field. If the type in type_of_loader is 0xE, then + the actual type is (ext_loader_type + 0x10). + + This field is ignored if the type in type_of_loader is not 0xE. + + Kernels prior to 2.6.31 did not recognize this field, but it is safe + to write for protocol version 2.02 or higher. + +============ ================== +Field name: cmd_line_ptr +Type: write (obligatory) +Offset/size: 0x228/4 +Protocol: 2.02+ +============ ================== + + Set this field to the linear address of the kernel command line. + The kernel command line can be located anywhere between the end of + the setup heap and 0xA0000; it does not have to be located in the + same 64K segment as the real-mode code itself. + + Fill in this field even if your boot loader does not support a + command line, in which case you can point this to an empty string + (or better yet, to the string "auto".) If this field is left at + zero, the kernel will assume that your boot loader does not support + the 2.02+ protocol. + +============ =============== +Field name: initrd_addr_max +Type: read +Offset/size: 0x22c/4 +Protocol: 2.03+ +============ =============== + + The maximum address that may be occupied by the initial + ramdisk/ramfs contents. For boot protocols 2.02 or earlier, this + field is not present, and the maximum address is 0x37FFFFFF. (This + address is defined as the address of the highest safe byte, so if + your ramdisk is exactly 131072 bytes long and this field is + 0x37FFFFFF, you can start your ramdisk at 0x37FE0000.) + +============ ============================ +Field name: kernel_alignment +Type: read/modify (reloc) +Offset/size: 0x230/4 +Protocol: 2.05+ (read), 2.10+ (modify) +============ ============================ + + Alignment unit required by the kernel (if relocatable_kernel is + true.) A relocatable kernel that is loaded at an alignment + incompatible with the value in this field will be realigned during + kernel initialization. + + Starting with protocol version 2.10, this reflects the kernel + alignment preferred for optimal performance; it is possible for the + loader to modify this field to permit a lesser alignment. See the + min_alignment and pref_address field below. + +============ ================== +Field name: relocatable_kernel +Type: read (reloc) +Offset/size: 0x234/1 +Protocol: 2.05+ +============ ================== + + If this field is nonzero, the protected-mode part of the kernel can + be loaded at any address that satisfies the kernel_alignment field. + After loading, the boot loader must set the code32_start field to + point to the loaded code, or to a boot loader hook. + +============ ============= +Field name: min_alignment +Type: read (reloc) +Offset/size: 0x235/1 +Protocol: 2.10+ +============ ============= + + This field, if nonzero, indicates as a power of two the minimum + alignment required, as opposed to preferred, by the kernel to boot. + If a boot loader makes use of this field, it should update the + kernel_alignment field with the alignment unit desired; typically:: + + kernel_alignment = 1 << min_alignment + + There may be a considerable performance cost with an excessively + misaligned kernel. Therefore, a loader should typically try each + power-of-two alignment from kernel_alignment down to this alignment. + +============ ========== +Field name: xloadflags +Type: read +Offset/size: 0x236/2 +Protocol: 2.12+ +============ ========== + + This field is a bitmask. + + Bit 0 (read): XLF_KERNEL_64 + + - If 1, this kernel has the legacy 64-bit entry point at 0x200. + + Bit 1 (read): XLF_CAN_BE_LOADED_ABOVE_4G + + - If 1, kernel/boot_params/cmdline/ramdisk can be above 4G. + + Bit 2 (read): XLF_EFI_HANDOVER_32 + + - If 1, the kernel supports the 32-bit EFI handoff entry point + given at handover_offset. + + Bit 3 (read): XLF_EFI_HANDOVER_64 + + - If 1, the kernel supports the 64-bit EFI handoff entry point + given at handover_offset + 0x200. + + Bit 4 (read): XLF_EFI_KEXEC + + - If 1, the kernel supports kexec EFI boot with EFI runtime support. + + +============ ============ +Field name: cmdline_size +Type: read +Offset/size: 0x238/4 +Protocol: 2.06+ +============ ============ + + The maximum size of the command line without the terminating + zero. This means that the command line can contain at most + cmdline_size characters. With protocol version 2.05 and earlier, the + maximum size was 255. + +============ ==================================== +Field name: hardware_subarch +Type: write (optional, defaults to x86/PC) +Offset/size: 0x23c/4 +Protocol: 2.07+ +============ ==================================== + + In a paravirtualized environment the hardware low level architectural + pieces such as interrupt handling, page table handling, and + accessing process control registers needs to be done differently. + + This field allows the bootloader to inform the kernel we are in one + one of those environments. + + ========== ============================== + 0x00000000 The default x86/PC environment + 0x00000001 lguest + 0x00000002 Xen + 0x00000003 Moorestown MID + 0x00000004 CE4100 TV Platform + ========== ============================== + +============ ========================= +Field name: hardware_subarch_data +Type: write (subarch-dependent) +Offset/size: 0x240/8 +Protocol: 2.07+ +============ ========================= + + A pointer to data that is specific to hardware subarch + This field is currently unused for the default x86/PC environment, + do not modify. + +============ ============== +Field name: payload_offset +Type: read +Offset/size: 0x248/4 +Protocol: 2.08+ +============ ============== + + If non-zero then this field contains the offset from the beginning + of the protected-mode code to the payload. + + The payload may be compressed. The format of both the compressed and + uncompressed data should be determined using the standard magic + numbers. The currently supported compression formats are gzip + (magic numbers 1F 8B or 1F 9E), bzip2 (magic number 42 5A), LZMA + (magic number 5D 00), XZ (magic number FD 37), and LZ4 (magic number + 02 21). The uncompressed payload is currently always ELF (magic + number 7F 45 4C 46). + +============ ============== +Field name: payload_length +Type: read +Offset/size: 0x24c/4 +Protocol: 2.08+ +============ ============== + + The length of the payload. + +============ =============== +Field name: setup_data +Type: write (special) +Offset/size: 0x250/8 +Protocol: 2.09+ +============ =============== + + The 64-bit physical pointer to NULL terminated single linked list of + struct setup_data. This is used to define a more extensible boot + parameters passing mechanism. The definition of struct setup_data is + as follow:: + + struct setup_data { + u64 next; + u32 type; + u32 len; + u8 data[0]; + }; + + Where, the next is a 64-bit physical pointer to the next node of + linked list, the next field of the last node is 0; the type is used + to identify the contents of data; the len is the length of data + field; the data holds the real payload. + + This list may be modified at a number of points during the bootup + process. Therefore, when modifying this list one should always make + sure to consider the case where the linked list already contains + entries. + +============ ============ +Field name: pref_address +Type: read (reloc) +Offset/size: 0x258/8 +Protocol: 2.10+ +============ ============ + + This field, if nonzero, represents a preferred load address for the + kernel. A relocating bootloader should attempt to load at this + address if possible. + + A non-relocatable kernel will unconditionally move itself and to run + at this address. + +============ ======= +Field name: init_size +Type: read +Offset/size: 0x260/4 +============ ======= + + This field indicates the amount of linear contiguous memory starting + at the kernel runtime start address that the kernel needs before it + is capable of examining its memory map. This is not the same thing + as the total amount of memory the kernel needs to boot, but it can + be used by a relocating boot loader to help select a safe load + address for the kernel. + + The kernel runtime start address is determined by the following algorithm:: + + if (relocatable_kernel) + runtime_start = align_up(load_address, kernel_alignment) + else + runtime_start = pref_address + +============ =============== +Field name: handover_offset +Type: read +Offset/size: 0x264/4 +============ =============== + + This field is the offset from the beginning of the kernel image to + the EFI handover protocol entry point. Boot loaders using the EFI + handover protocol to boot the kernel should jump to this offset. + + See EFI HANDOVER PROTOCOL below for more details. + + +The Image Checksum +================== + +From boot protocol version 2.08 onwards the CRC-32 is calculated over +the entire file using the characteristic polynomial 0x04C11DB7 and an +initial remainder of 0xffffffff. The checksum is appended to the +file; therefore the CRC of the file up to the limit specified in the +syssize field of the header is always 0. + + +The Kernel Command Line +======================= + +The kernel command line has become an important way for the boot +loader to communicate with the kernel. Some of its options are also +relevant to the boot loader itself, see "special command line options" +below. + +The kernel command line is a null-terminated string. The maximum +length can be retrieved from the field cmdline_size. Before protocol +version 2.06, the maximum was 255 characters. A string that is too +long will be automatically truncated by the kernel. + +If the boot protocol version is 2.02 or later, the address of the +kernel command line is given by the header field cmd_line_ptr (see +above.) This address can be anywhere between the end of the setup +heap and 0xA0000. + +If the protocol version is *not* 2.02 or higher, the kernel +command line is entered using the following protocol: + + - At offset 0x0020 (word), "cmd_line_magic", enter the magic + number 0xA33F. + + - At offset 0x0022 (word), "cmd_line_offset", enter the offset + of the kernel command line (relative to the start of the + real-mode kernel). + + - The kernel command line *must* be within the memory region + covered by setup_move_size, so you may need to adjust this + field. + + +Memory Layout of The Real-Mode Code +=================================== + +The real-mode code requires a stack/heap to be set up, as well as +memory allocated for the kernel command line. This needs to be done +in the real-mode accessible memory in bottom megabyte. + +It should be noted that modern machines often have a sizable Extended +BIOS Data Area (EBDA). As a result, it is advisable to use as little +of the low megabyte as possible. + +Unfortunately, under the following circumstances the 0x90000 memory +segment has to be used: + + - When loading a zImage kernel ((loadflags & 0x01) == 0). + - When loading a 2.01 or earlier boot protocol kernel. + +.. note:: + For the 2.00 and 2.01 boot protocols, the real-mode code + can be loaded at another address, but it is internally + relocated to 0x90000. For the "old" protocol, the + real-mode code must be loaded at 0x90000. + +When loading at 0x90000, avoid using memory above 0x9a000. + +For boot protocol 2.02 or higher, the command line does not have to be +located in the same 64K segment as the real-mode setup code; it is +thus permitted to give the stack/heap the full 64K segment and locate +the command line above it. + +The kernel command line should not be located below the real-mode +code, nor should it be located in high memory. + + +Sample Boot Configuartion +========================= + +As a sample configuration, assume the following layout of the real +mode segment. + + When loading below 0x90000, use the entire segment: + + ============= =================== + 0x0000-0x7fff Real mode kernel + 0x8000-0xdfff Stack and heap + 0xe000-0xffff Kernel command line + ============= =================== + + When loading at 0x90000 OR the protocol version is 2.01 or earlier: + + ============= =================== + 0x0000-0x7fff Real mode kernel + 0x8000-0x97ff Stack and heap + 0x9800-0x9fff Kernel command line + ============= =================== + +Such a boot loader should enter the following fields in the header:: + + unsigned long base_ptr; /* base address for real-mode segment */ + + if ( setup_sects == 0 ) { + setup_sects = 4; + } + + if ( protocol >= 0x0200 ) { + type_of_loader = ; + if ( loading_initrd ) { + ramdisk_image = ; + ramdisk_size = ; + } + + if ( protocol >= 0x0202 && loadflags & 0x01 ) + heap_end = 0xe000; + else + heap_end = 0x9800; + + if ( protocol >= 0x0201 ) { + heap_end_ptr = heap_end - 0x200; + loadflags |= 0x80; /* CAN_USE_HEAP */ + } + + if ( protocol >= 0x0202 ) { + cmd_line_ptr = base_ptr + heap_end; + strcpy(cmd_line_ptr, cmdline); + } else { + cmd_line_magic = 0xA33F; + cmd_line_offset = heap_end; + setup_move_size = heap_end + strlen(cmdline)+1; + strcpy(base_ptr+cmd_line_offset, cmdline); + } + } else { + /* Very old kernel */ + + heap_end = 0x9800; + + cmd_line_magic = 0xA33F; + cmd_line_offset = heap_end; + + /* A very old kernel MUST have its real-mode code + loaded at 0x90000 */ + + if ( base_ptr != 0x90000 ) { + /* Copy the real-mode kernel */ + memcpy(0x90000, base_ptr, (setup_sects+1)*512); + base_ptr = 0x90000; /* Relocated */ + } + + strcpy(0x90000+cmd_line_offset, cmdline); + + /* It is recommended to clear memory up to the 32K mark */ + memset(0x90000 + (setup_sects+1)*512, 0, + (64-(setup_sects+1))*512); + } + + +Loading The Rest of The Kernel +============================== + +The 32-bit (non-real-mode) kernel starts at offset (setup_sects+1)*512 +in the kernel file (again, if setup_sects == 0 the real value is 4.) +It should be loaded at address 0x10000 for Image/zImage kernels and +0x100000 for bzImage kernels. + +The kernel is a bzImage kernel if the protocol >= 2.00 and the 0x01 +bit (LOAD_HIGH) in the loadflags field is set:: + + is_bzImage = (protocol >= 0x0200) && (loadflags & 0x01); + load_address = is_bzImage ? 0x100000 : 0x10000; + +Note that Image/zImage kernels can be up to 512K in size, and thus use +the entire 0x10000-0x90000 range of memory. This means it is pretty +much a requirement for these kernels to load the real-mode part at +0x90000. bzImage kernels allow much more flexibility. + +Special Command Line Options +============================ + +If the command line provided by the boot loader is entered by the +user, the user may expect the following command line options to work. +They should normally not be deleted from the kernel command line even +though not all of them are actually meaningful to the kernel. Boot +loader authors who need additional command line options for the boot +loader itself should get them registered in +Documentation/admin-guide/kernel-parameters.rst to make sure they will not +conflict with actual kernel options now or in the future. + + vga= + here is either an integer (in C notation, either + decimal, octal, or hexadecimal) or one of the strings + "normal" (meaning 0xFFFF), "ext" (meaning 0xFFFE) or "ask" + (meaning 0xFFFD). This value should be entered into the + vid_mode field, as it is used by the kernel before the command + line is parsed. + + mem= + is an integer in C notation optionally followed by + (case insensitive) K, M, G, T, P or E (meaning << 10, << 20, + << 30, << 40, << 50 or << 60). This specifies the end of + memory to the kernel. This affects the possible placement of + an initrd, since an initrd should be placed near end of + memory. Note that this is an option to *both* the kernel and + the bootloader! + + initrd= + An initrd should be loaded. The meaning of is + obviously bootloader-dependent, and some boot loaders + (e.g. LILO) do not have such a command. + +In addition, some boot loaders add the following options to the +user-specified command line: + + BOOT_IMAGE= + The boot image which was loaded. Again, the meaning of + is obviously bootloader-dependent. + + auto + The kernel was booted without explicit user intervention. + +If these options are added by the boot loader, it is highly +recommended that they are located *first*, before the user-specified +or configuration-specified command line. Otherwise, "init=/bin/sh" +gets confused by the "auto" option. + + +Running the Kernel +================== + +The kernel is started by jumping to the kernel entry point, which is +located at *segment* offset 0x20 from the start of the real mode +kernel. This means that if you loaded your real-mode kernel code at +0x90000, the kernel entry point is 9020:0000. + +At entry, ds = es = ss should point to the start of the real-mode +kernel code (0x9000 if the code is loaded at 0x90000), sp should be +set up properly, normally pointing to the top of the heap, and +interrupts should be disabled. Furthermore, to guard against bugs in +the kernel, it is recommended that the boot loader sets fs = gs = ds = +es = ss. + +In our example from above, we would do:: + + /* Note: in the case of the "old" kernel protocol, base_ptr must + be == 0x90000 at this point; see the previous sample code */ + + seg = base_ptr >> 4; + + cli(); /* Enter with interrupts disabled! */ + + /* Set up the real-mode kernel stack */ + _SS = seg; + _SP = heap_end; + + _DS = _ES = _FS = _GS = seg; + jmp_far(seg+0x20, 0); /* Run the kernel */ + +If your boot sector accesses a floppy drive, it is recommended to +switch off the floppy motor before running the kernel, since the +kernel boot leaves interrupts off and thus the motor will not be +switched off, especially if the loaded kernel has the floppy driver as +a demand-loaded module! + + +Advanced Boot Loader Hooks +========================== + +If the boot loader runs in a particularly hostile environment (such as +LOADLIN, which runs under DOS) it may be impossible to follow the +standard memory location requirements. Such a boot loader may use the +following hooks that, if set, are invoked by the kernel at the +appropriate time. The use of these hooks should probably be +considered an absolutely last resort! + +IMPORTANT: All the hooks are required to preserve %esp, %ebp, %esi and +%edi across invocation. + + realmode_swtch: + A 16-bit real mode far subroutine invoked immediately before + entering protected mode. The default routine disables NMI, so + your routine should probably do so, too. + + code32_start: + A 32-bit flat-mode routine *jumped* to immediately after the + transition to protected mode, but before the kernel is + uncompressed. No segments, except CS, are guaranteed to be + set up (current kernels do, but older ones do not); you should + set them up to BOOT_DS (0x18) yourself. + + After completing your hook, you should jump to the address + that was in this field before your boot loader overwrote it + (relocated, if appropriate.) + + +32-bit Boot Protocol +==================== + +For machine with some new BIOS other than legacy BIOS, such as EFI, +LinuxBIOS, etc, and kexec, the 16-bit real mode setup code in kernel +based on legacy BIOS can not be used, so a 32-bit boot protocol needs +to be defined. + +In 32-bit boot protocol, the first step in loading a Linux kernel +should be to setup the boot parameters (struct boot_params, +traditionally known as "zero page"). The memory for struct boot_params +should be allocated and initialized to all zero. Then the setup header +from offset 0x01f1 of kernel image on should be loaded into struct +boot_params and examined. The end of setup header can be calculated as +follow:: + + 0x0202 + byte value at offset 0x0201 + +In addition to read/modify/write the setup header of the struct +boot_params as that of 16-bit boot protocol, the boot loader should +also fill the additional fields of the struct boot_params as that +described in zero-page.txt. + +After setting up the struct boot_params, the boot loader can load the +32/64-bit kernel in the same way as that of 16-bit boot protocol. + +In 32-bit boot protocol, the kernel is started by jumping to the +32-bit kernel entry point, which is the start address of loaded +32/64-bit kernel. + +At entry, the CPU must be in 32-bit protected mode with paging +disabled; a GDT must be loaded with the descriptors for selectors +__BOOT_CS(0x10) and __BOOT_DS(0x18); both descriptors must be 4G flat +segment; __BOOT_CS must have execute/read permission, and __BOOT_DS +must have read/write permission; CS must be __BOOT_CS and DS, ES, SS +must be __BOOT_DS; interrupt must be disabled; %esi must hold the base +address of the struct boot_params; %ebp, %edi and %ebx must be zero. + +64-bit Boot Protocol +==================== + +For machine with 64bit cpus and 64bit kernel, we could use 64bit bootloader +and we need a 64-bit boot protocol. + +In 64-bit boot protocol, the first step in loading a Linux kernel +should be to setup the boot parameters (struct boot_params, +traditionally known as "zero page"). The memory for struct boot_params +could be allocated anywhere (even above 4G) and initialized to all zero. +Then, the setup header at offset 0x01f1 of kernel image on should be +loaded into struct boot_params and examined. The end of setup header +can be calculated as follows:: + + 0x0202 + byte value at offset 0x0201 + +In addition to read/modify/write the setup header of the struct +boot_params as that of 16-bit boot protocol, the boot loader should +also fill the additional fields of the struct boot_params as described +in zero-page.txt. + +After setting up the struct boot_params, the boot loader can load +64-bit kernel in the same way as that of 16-bit boot protocol, but +kernel could be loaded above 4G. + +In 64-bit boot protocol, the kernel is started by jumping to the +64-bit kernel entry point, which is the start address of loaded +64-bit kernel plus 0x200. + +At entry, the CPU must be in 64-bit mode with paging enabled. +The range with setup_header.init_size from start address of loaded +kernel and zero page and command line buffer get ident mapping; +a GDT must be loaded with the descriptors for selectors +__BOOT_CS(0x10) and __BOOT_DS(0x18); both descriptors must be 4G flat +segment; __BOOT_CS must have execute/read permission, and __BOOT_DS +must have read/write permission; CS must be __BOOT_CS and DS, ES, SS +must be __BOOT_DS; interrupt must be disabled; %rsi must hold the base +address of the struct boot_params. + +EFI Handover Protocol +===================== + +This protocol allows boot loaders to defer initialisation to the EFI +boot stub. The boot loader is required to load the kernel/initrd(s) +from the boot media and jump to the EFI handover protocol entry point +which is hdr->handover_offset bytes from the beginning of +startup_{32,64}. + +The function prototype for the handover entry point looks like this:: + + efi_main(void *handle, efi_system_table_t *table, struct boot_params *bp) + +'handle' is the EFI image handle passed to the boot loader by the EFI +firmware, 'table' is the EFI system table - these are the first two +arguments of the "handoff state" as described in section 2.3 of the +UEFI specification. 'bp' is the boot loader-allocated boot params. + +The boot loader *must* fill out the following fields in bp:: + + - hdr.code32_start + - hdr.cmd_line_ptr + - hdr.ramdisk_image (if applicable) + - hdr.ramdisk_size (if applicable) + +All other fields should be zero. diff --git a/Documentation/x86/boot.txt b/Documentation/x86/boot.txt deleted file mode 100644 index 223e484a1304..000000000000 --- a/Documentation/x86/boot.txt +++ /dev/null @@ -1,1134 +0,0 @@ - THE LINUX/x86 BOOT PROTOCOL - --------------------------- - -On the x86 platform, the Linux kernel uses a rather complicated boot -convention. This has evolved partially due to historical aspects, as -well as the desire in the early days to have the kernel itself be a -bootable image, the complicated PC memory model and due to changed -expectations in the PC industry caused by the effective demise of -real-mode DOS as a mainstream operating system. - -Currently, the following versions of the Linux/x86 boot protocol exist. - -Old kernels: zImage/Image support only. Some very early kernels - may not even support a command line. - -Protocol 2.00: (Kernel 1.3.73) Added bzImage and initrd support, as - well as a formalized way to communicate between the - boot loader and the kernel. setup.S made relocatable, - although the traditional setup area still assumed - writable. - -Protocol 2.01: (Kernel 1.3.76) Added a heap overrun warning. - -Protocol 2.02: (Kernel 2.4.0-test3-pre3) New command line protocol. - Lower the conventional memory ceiling. No overwrite - of the traditional setup area, thus making booting - safe for systems which use the EBDA from SMM or 32-bit - BIOS entry points. zImage deprecated but still - supported. - -Protocol 2.03: (Kernel 2.4.18-pre1) Explicitly makes the highest possible - initrd address available to the bootloader. - -Protocol 2.04: (Kernel 2.6.14) Extend the syssize field to four bytes. - -Protocol 2.05: (Kernel 2.6.20) Make protected mode kernel relocatable. - Introduce relocatable_kernel and kernel_alignment fields. - -Protocol 2.06: (Kernel 2.6.22) Added a field that contains the size of - the boot command line. - -Protocol 2.07: (Kernel 2.6.24) Added paravirtualised boot protocol. - Introduced hardware_subarch and hardware_subarch_data - and KEEP_SEGMENTS flag in load_flags. - -Protocol 2.08: (Kernel 2.6.26) Added crc32 checksum and ELF format - payload. Introduced payload_offset and payload_length - fields to aid in locating the payload. - -Protocol 2.09: (Kernel 2.6.26) Added a field of 64-bit physical - pointer to single linked list of struct setup_data. - -Protocol 2.10: (Kernel 2.6.31) Added a protocol for relaxed alignment - beyond the kernel_alignment added, new init_size and - pref_address fields. Added extended boot loader IDs. - -Protocol 2.11: (Kernel 3.6) Added a field for offset of EFI handover - protocol entry point. - -Protocol 2.12: (Kernel 3.8) Added the xloadflags field and extension fields - to struct boot_params for loading bzImage and ramdisk - above 4G in 64bit. - -Protocol 2.13: (Kernel 3.14) Support 32- and 64-bit flags being set in - xloadflags to support booting a 64-bit kernel from 32-bit - EFI - -**** MEMORY LAYOUT - -The traditional memory map for the kernel loader, used for Image or -zImage kernels, typically looks like: - - | | -0A0000 +------------------------+ - | Reserved for BIOS | Do not use. Reserved for BIOS EBDA. -09A000 +------------------------+ - | Command line | - | Stack/heap | For use by the kernel real-mode code. -098000 +------------------------+ - | Kernel setup | The kernel real-mode code. -090200 +------------------------+ - | Kernel boot sector | The kernel legacy boot sector. -090000 +------------------------+ - | Protected-mode kernel | The bulk of the kernel image. -010000 +------------------------+ - | Boot loader | <- Boot sector entry point 0000:7C00 -001000 +------------------------+ - | Reserved for MBR/BIOS | -000800 +------------------------+ - | Typically used by MBR | -000600 +------------------------+ - | BIOS use only | -000000 +------------------------+ - - -When using bzImage, the protected-mode kernel was relocated to -0x100000 ("high memory"), and the kernel real-mode block (boot sector, -setup, and stack/heap) was made relocatable to any address between -0x10000 and end of low memory. Unfortunately, in protocols 2.00 and -2.01 the 0x90000+ memory range is still used internally by the kernel; -the 2.02 protocol resolves that problem. - -It is desirable to keep the "memory ceiling" -- the highest point in -low memory touched by the boot loader -- as low as possible, since -some newer BIOSes have begun to allocate some rather large amounts of -memory, called the Extended BIOS Data Area, near the top of low -memory. The boot loader should use the "INT 12h" BIOS call to verify -how much low memory is available. - -Unfortunately, if INT 12h reports that the amount of memory is too -low, there is usually nothing the boot loader can do but to report an -error to the user. The boot loader should therefore be designed to -take up as little space in low memory as it reasonably can. For -zImage or old bzImage kernels, which need data written into the -0x90000 segment, the boot loader should make sure not to use memory -above the 0x9A000 point; too many BIOSes will break above that point. - -For a modern bzImage kernel with boot protocol version >= 2.02, a -memory layout like the following is suggested: - - ~ ~ - | Protected-mode kernel | -100000 +------------------------+ - | I/O memory hole | -0A0000 +------------------------+ - | Reserved for BIOS | Leave as much as possible unused - ~ ~ - | Command line | (Can also be below the X+10000 mark) -X+10000 +------------------------+ - | Stack/heap | For use by the kernel real-mode code. -X+08000 +------------------------+ - | Kernel setup | The kernel real-mode code. - | Kernel boot sector | The kernel legacy boot sector. -X +------------------------+ - | Boot loader | <- Boot sector entry point 0000:7C00 -001000 +------------------------+ - | Reserved for MBR/BIOS | -000800 +------------------------+ - | Typically used by MBR | -000600 +------------------------+ - | BIOS use only | -000000 +------------------------+ - -... where the address X is as low as the design of the boot loader -permits. - - -**** THE REAL-MODE KERNEL HEADER - -In the following text, and anywhere in the kernel boot sequence, "a -sector" refers to 512 bytes. It is independent of the actual sector -size of the underlying medium. - -The first step in loading a Linux kernel should be to load the -real-mode code (boot sector and setup code) and then examine the -following header at offset 0x01f1. The real-mode code can total up to -32K, although the boot loader may choose to load only the first two -sectors (1K) and then examine the bootup sector size. - -The header looks like: - -Offset Proto Name Meaning -/Size - -01F1/1 ALL(1 setup_sects The size of the setup in sectors -01F2/2 ALL root_flags If set, the root is mounted readonly -01F4/4 2.04+(2 syssize The size of the 32-bit code in 16-byte paras -01F8/2 ALL ram_size DO NOT USE - for bootsect.S use only -01FA/2 ALL vid_mode Video mode control -01FC/2 ALL root_dev Default root device number -01FE/2 ALL boot_flag 0xAA55 magic number -0200/2 2.00+ jump Jump instruction -0202/4 2.00+ header Magic signature "HdrS" -0206/2 2.00+ version Boot protocol version supported -0208/4 2.00+ realmode_swtch Boot loader hook (see below) -020C/2 2.00+ start_sys_seg The load-low segment (0x1000) (obsolete) -020E/2 2.00+ kernel_version Pointer to kernel version string -0210/1 2.00+ type_of_loader Boot loader identifier -0211/1 2.00+ loadflags Boot protocol option flags -0212/2 2.00+ setup_move_size Move to high memory size (used with hooks) -0214/4 2.00+ code32_start Boot loader hook (see below) -0218/4 2.00+ ramdisk_image initrd load address (set by boot loader) -021C/4 2.00+ ramdisk_size initrd size (set by boot loader) -0220/4 2.00+ bootsect_kludge DO NOT USE - for bootsect.S use only -0224/2 2.01+ heap_end_ptr Free memory after setup end -0226/1 2.02+(3 ext_loader_ver Extended boot loader version -0227/1 2.02+(3 ext_loader_type Extended boot loader ID -0228/4 2.02+ cmd_line_ptr 32-bit pointer to the kernel command line -022C/4 2.03+ initrd_addr_max Highest legal initrd address -0230/4 2.05+ kernel_alignment Physical addr alignment required for kernel -0234/1 2.05+ relocatable_kernel Whether kernel is relocatable or not -0235/1 2.10+ min_alignment Minimum alignment, as a power of two -0236/2 2.12+ xloadflags Boot protocol option flags -0238/4 2.06+ cmdline_size Maximum size of the kernel command line -023C/4 2.07+ hardware_subarch Hardware subarchitecture -0240/8 2.07+ hardware_subarch_data Subarchitecture-specific data -0248/4 2.08+ payload_offset Offset of kernel payload -024C/4 2.08+ payload_length Length of kernel payload -0250/8 2.09+ setup_data 64-bit physical pointer to linked list - of struct setup_data -0258/8 2.10+ pref_address Preferred loading address -0260/4 2.10+ init_size Linear memory required during initialization -0264/4 2.11+ handover_offset Offset of handover entry point - -(1) For backwards compatibility, if the setup_sects field contains 0, the - real value is 4. - -(2) For boot protocol prior to 2.04, the upper two bytes of the syssize - field are unusable, which means the size of a bzImage kernel - cannot be determined. - -(3) Ignored, but safe to set, for boot protocols 2.02-2.09. - -If the "HdrS" (0x53726448) magic number is not found at offset 0x202, -the boot protocol version is "old". Loading an old kernel, the -following parameters should be assumed: - - Image type = zImage - initrd not supported - Real-mode kernel must be located at 0x90000. - -Otherwise, the "version" field contains the protocol version, -e.g. protocol version 2.01 will contain 0x0201 in this field. When -setting fields in the header, you must make sure only to set fields -supported by the protocol version in use. - - -**** DETAILS OF HEADER FIELDS - -For each field, some are information from the kernel to the bootloader -("read"), some are expected to be filled out by the bootloader -("write"), and some are expected to be read and modified by the -bootloader ("modify"). - -All general purpose boot loaders should write the fields marked -(obligatory). Boot loaders who want to load the kernel at a -nonstandard address should fill in the fields marked (reloc); other -boot loaders can ignore those fields. - -The byte order of all fields is littleendian (this is x86, after all.) - -Field name: setup_sects -Type: read -Offset/size: 0x1f1/1 -Protocol: ALL - - The size of the setup code in 512-byte sectors. If this field is - 0, the real value is 4. The real-mode code consists of the boot - sector (always one 512-byte sector) plus the setup code. - -Field name: root_flags -Type: modify (optional) -Offset/size: 0x1f2/2 -Protocol: ALL - - If this field is nonzero, the root defaults to readonly. The use of - this field is deprecated; use the "ro" or "rw" options on the - command line instead. - -Field name: syssize -Type: read -Offset/size: 0x1f4/4 (protocol 2.04+) 0x1f4/2 (protocol ALL) -Protocol: 2.04+ - - The size of the protected-mode code in units of 16-byte paragraphs. - For protocol versions older than 2.04 this field is only two bytes - wide, and therefore cannot be trusted for the size of a kernel if - the LOAD_HIGH flag is set. - -Field name: ram_size -Type: kernel internal -Offset/size: 0x1f8/2 -Protocol: ALL - - This field is obsolete. - -Field name: vid_mode -Type: modify (obligatory) -Offset/size: 0x1fa/2 - - Please see the section on SPECIAL COMMAND LINE OPTIONS. - -Field name: root_dev -Type: modify (optional) -Offset/size: 0x1fc/2 -Protocol: ALL - - The default root device device number. The use of this field is - deprecated, use the "root=" option on the command line instead. - -Field name: boot_flag -Type: read -Offset/size: 0x1fe/2 -Protocol: ALL - - Contains 0xAA55. This is the closest thing old Linux kernels have - to a magic number. - -Field name: jump -Type: read -Offset/size: 0x200/2 -Protocol: 2.00+ - - Contains an x86 jump instruction, 0xEB followed by a signed offset - relative to byte 0x202. This can be used to determine the size of - the header. - -Field name: header -Type: read -Offset/size: 0x202/4 -Protocol: 2.00+ - - Contains the magic number "HdrS" (0x53726448). - -Field name: version -Type: read -Offset/size: 0x206/2 -Protocol: 2.00+ - - Contains the boot protocol version, in (major << 8)+minor format, - e.g. 0x0204 for version 2.04, and 0x0a11 for a hypothetical version - 10.17. - -Field name: realmode_swtch -Type: modify (optional) -Offset/size: 0x208/4 -Protocol: 2.00+ - - Boot loader hook (see ADVANCED BOOT LOADER HOOKS below.) - -Field name: start_sys_seg -Type: read -Offset/size: 0x20c/2 -Protocol: 2.00+ - - The load low segment (0x1000). Obsolete. - -Field name: kernel_version -Type: read -Offset/size: 0x20e/2 -Protocol: 2.00+ - - If set to a nonzero value, contains a pointer to a NUL-terminated - human-readable kernel version number string, less 0x200. This can - be used to display the kernel version to the user. This value - should be less than (0x200*setup_sects). - - For example, if this value is set to 0x1c00, the kernel version - number string can be found at offset 0x1e00 in the kernel file. - This is a valid value if and only if the "setup_sects" field - contains the value 15 or higher, as: - - 0x1c00 < 15*0x200 (= 0x1e00) but - 0x1c00 >= 14*0x200 (= 0x1c00) - - 0x1c00 >> 9 = 14, so the minimum value for setup_secs is 15. - -Field name: type_of_loader -Type: write (obligatory) -Offset/size: 0x210/1 -Protocol: 2.00+ - - If your boot loader has an assigned id (see table below), enter - 0xTV here, where T is an identifier for the boot loader and V is - a version number. Otherwise, enter 0xFF here. - - For boot loader IDs above T = 0xD, write T = 0xE to this field and - write the extended ID minus 0x10 to the ext_loader_type field. - Similarly, the ext_loader_ver field can be used to provide more than - four bits for the bootloader version. - - For example, for T = 0x15, V = 0x234, write: - - type_of_loader <- 0xE4 - ext_loader_type <- 0x05 - ext_loader_ver <- 0x23 - - Assigned boot loader ids (hexadecimal): - - 0 LILO (0x00 reserved for pre-2.00 bootloader) - 1 Loadlin - 2 bootsect-loader (0x20, all other values reserved) - 3 Syslinux - 4 Etherboot/gPXE/iPXE - 5 ELILO - 7 GRUB - 8 U-Boot - 9 Xen - A Gujin - B Qemu - C Arcturus Networks uCbootloader - D kexec-tools - E Extended (see ext_loader_type) - F Special (0xFF = undefined) - 10 Reserved - 11 Minimal Linux Bootloader - 12 OVMF UEFI virtualization stack - - Please contact if you need a bootloader ID - value assigned. - -Field name: loadflags -Type: modify (obligatory) -Offset/size: 0x211/1 -Protocol: 2.00+ - - This field is a bitmask. - - Bit 0 (read): LOADED_HIGH - - If 0, the protected-mode code is loaded at 0x10000. - - If 1, the protected-mode code is loaded at 0x100000. - - Bit 1 (kernel internal): KASLR_FLAG - - Used internally by the compressed kernel to communicate - KASLR status to kernel proper. - If 1, KASLR enabled. - If 0, KASLR disabled. - - Bit 5 (write): QUIET_FLAG - - If 0, print early messages. - - If 1, suppress early messages. - This requests to the kernel (decompressor and early - kernel) to not write early messages that require - accessing the display hardware directly. - - Bit 6 (write): KEEP_SEGMENTS - Protocol: 2.07+ - - If 0, reload the segment registers in the 32bit entry point. - - If 1, do not reload the segment registers in the 32bit entry point. - Assume that %cs %ds %ss %es are all set to flat segments with - a base of 0 (or the equivalent for their environment). - - Bit 7 (write): CAN_USE_HEAP - Set this bit to 1 to indicate that the value entered in the - heap_end_ptr is valid. If this field is clear, some setup code - functionality will be disabled. - -Field name: setup_move_size -Type: modify (obligatory) -Offset/size: 0x212/2 -Protocol: 2.00-2.01 - - When using protocol 2.00 or 2.01, if the real mode kernel is not - loaded at 0x90000, it gets moved there later in the loading - sequence. Fill in this field if you want additional data (such as - the kernel command line) moved in addition to the real-mode kernel - itself. - - The unit is bytes starting with the beginning of the boot sector. - - This field is can be ignored when the protocol is 2.02 or higher, or - if the real-mode code is loaded at 0x90000. - -Field name: code32_start -Type: modify (optional, reloc) -Offset/size: 0x214/4 -Protocol: 2.00+ - - The address to jump to in protected mode. This defaults to the load - address of the kernel, and can be used by the boot loader to - determine the proper load address. - - This field can be modified for two purposes: - - 1. as a boot loader hook (see ADVANCED BOOT LOADER HOOKS below.) - - 2. if a bootloader which does not install a hook loads a - relocatable kernel at a nonstandard address it will have to modify - this field to point to the load address. - -Field name: ramdisk_image -Type: write (obligatory) -Offset/size: 0x218/4 -Protocol: 2.00+ - - The 32-bit linear address of the initial ramdisk or ramfs. Leave at - zero if there is no initial ramdisk/ramfs. - -Field name: ramdisk_size -Type: write (obligatory) -Offset/size: 0x21c/4 -Protocol: 2.00+ - - Size of the initial ramdisk or ramfs. Leave at zero if there is no - initial ramdisk/ramfs. - -Field name: bootsect_kludge -Type: kernel internal -Offset/size: 0x220/4 -Protocol: 2.00+ - - This field is obsolete. - -Field name: heap_end_ptr -Type: write (obligatory) -Offset/size: 0x224/2 -Protocol: 2.01+ - - Set this field to the offset (from the beginning of the real-mode - code) of the end of the setup stack/heap, minus 0x0200. - -Field name: ext_loader_ver -Type: write (optional) -Offset/size: 0x226/1 -Protocol: 2.02+ - - This field is used as an extension of the version number in the - type_of_loader field. The total version number is considered to be - (type_of_loader & 0x0f) + (ext_loader_ver << 4). - - The use of this field is boot loader specific. If not written, it - is zero. - - Kernels prior to 2.6.31 did not recognize this field, but it is safe - to write for protocol version 2.02 or higher. - -Field name: ext_loader_type -Type: write (obligatory if (type_of_loader & 0xf0) == 0xe0) -Offset/size: 0x227/1 -Protocol: 2.02+ - - This field is used as an extension of the type number in - type_of_loader field. If the type in type_of_loader is 0xE, then - the actual type is (ext_loader_type + 0x10). - - This field is ignored if the type in type_of_loader is not 0xE. - - Kernels prior to 2.6.31 did not recognize this field, but it is safe - to write for protocol version 2.02 or higher. - -Field name: cmd_line_ptr -Type: write (obligatory) -Offset/size: 0x228/4 -Protocol: 2.02+ - - Set this field to the linear address of the kernel command line. - The kernel command line can be located anywhere between the end of - the setup heap and 0xA0000; it does not have to be located in the - same 64K segment as the real-mode code itself. - - Fill in this field even if your boot loader does not support a - command line, in which case you can point this to an empty string - (or better yet, to the string "auto".) If this field is left at - zero, the kernel will assume that your boot loader does not support - the 2.02+ protocol. - -Field name: initrd_addr_max -Type: read -Offset/size: 0x22c/4 -Protocol: 2.03+ - - The maximum address that may be occupied by the initial - ramdisk/ramfs contents. For boot protocols 2.02 or earlier, this - field is not present, and the maximum address is 0x37FFFFFF. (This - address is defined as the address of the highest safe byte, so if - your ramdisk is exactly 131072 bytes long and this field is - 0x37FFFFFF, you can start your ramdisk at 0x37FE0000.) - -Field name: kernel_alignment -Type: read/modify (reloc) -Offset/size: 0x230/4 -Protocol: 2.05+ (read), 2.10+ (modify) - - Alignment unit required by the kernel (if relocatable_kernel is - true.) A relocatable kernel that is loaded at an alignment - incompatible with the value in this field will be realigned during - kernel initialization. - - Starting with protocol version 2.10, this reflects the kernel - alignment preferred for optimal performance; it is possible for the - loader to modify this field to permit a lesser alignment. See the - min_alignment and pref_address field below. - -Field name: relocatable_kernel -Type: read (reloc) -Offset/size: 0x234/1 -Protocol: 2.05+ - - If this field is nonzero, the protected-mode part of the kernel can - be loaded at any address that satisfies the kernel_alignment field. - After loading, the boot loader must set the code32_start field to - point to the loaded code, or to a boot loader hook. - -Field name: min_alignment -Type: read (reloc) -Offset/size: 0x235/1 -Protocol: 2.10+ - - This field, if nonzero, indicates as a power of two the minimum - alignment required, as opposed to preferred, by the kernel to boot. - If a boot loader makes use of this field, it should update the - kernel_alignment field with the alignment unit desired; typically: - - kernel_alignment = 1 << min_alignment - - There may be a considerable performance cost with an excessively - misaligned kernel. Therefore, a loader should typically try each - power-of-two alignment from kernel_alignment down to this alignment. - -Field name: xloadflags -Type: read -Offset/size: 0x236/2 -Protocol: 2.12+ - - This field is a bitmask. - - Bit 0 (read): XLF_KERNEL_64 - - If 1, this kernel has the legacy 64-bit entry point at 0x200. - - Bit 1 (read): XLF_CAN_BE_LOADED_ABOVE_4G - - If 1, kernel/boot_params/cmdline/ramdisk can be above 4G. - - Bit 2 (read): XLF_EFI_HANDOVER_32 - - If 1, the kernel supports the 32-bit EFI handoff entry point - given at handover_offset. - - Bit 3 (read): XLF_EFI_HANDOVER_64 - - If 1, the kernel supports the 64-bit EFI handoff entry point - given at handover_offset + 0x200. - - Bit 4 (read): XLF_EFI_KEXEC - - If 1, the kernel supports kexec EFI boot with EFI runtime support. - -Field name: cmdline_size -Type: read -Offset/size: 0x238/4 -Protocol: 2.06+ - - The maximum size of the command line without the terminating - zero. This means that the command line can contain at most - cmdline_size characters. With protocol version 2.05 and earlier, the - maximum size was 255. - -Field name: hardware_subarch -Type: write (optional, defaults to x86/PC) -Offset/size: 0x23c/4 -Protocol: 2.07+ - - In a paravirtualized environment the hardware low level architectural - pieces such as interrupt handling, page table handling, and - accessing process control registers needs to be done differently. - - This field allows the bootloader to inform the kernel we are in one - one of those environments. - - 0x00000000 The default x86/PC environment - 0x00000001 lguest - 0x00000002 Xen - 0x00000003 Moorestown MID - 0x00000004 CE4100 TV Platform - -Field name: hardware_subarch_data -Type: write (subarch-dependent) -Offset/size: 0x240/8 -Protocol: 2.07+ - - A pointer to data that is specific to hardware subarch - This field is currently unused for the default x86/PC environment, - do not modify. - -Field name: payload_offset -Type: read -Offset/size: 0x248/4 -Protocol: 2.08+ - - If non-zero then this field contains the offset from the beginning - of the protected-mode code to the payload. - - The payload may be compressed. The format of both the compressed and - uncompressed data should be determined using the standard magic - numbers. The currently supported compression formats are gzip - (magic numbers 1F 8B or 1F 9E), bzip2 (magic number 42 5A), LZMA - (magic number 5D 00), XZ (magic number FD 37), and LZ4 (magic number - 02 21). The uncompressed payload is currently always ELF (magic - number 7F 45 4C 46). - -Field name: payload_length -Type: read -Offset/size: 0x24c/4 -Protocol: 2.08+ - - The length of the payload. - -Field name: setup_data -Type: write (special) -Offset/size: 0x250/8 -Protocol: 2.09+ - - The 64-bit physical pointer to NULL terminated single linked list of - struct setup_data. This is used to define a more extensible boot - parameters passing mechanism. The definition of struct setup_data is - as follow: - - struct setup_data { - u64 next; - u32 type; - u32 len; - u8 data[0]; - }; - - Where, the next is a 64-bit physical pointer to the next node of - linked list, the next field of the last node is 0; the type is used - to identify the contents of data; the len is the length of data - field; the data holds the real payload. - - This list may be modified at a number of points during the bootup - process. Therefore, when modifying this list one should always make - sure to consider the case where the linked list already contains - entries. - -Field name: pref_address -Type: read (reloc) -Offset/size: 0x258/8 -Protocol: 2.10+ - - This field, if nonzero, represents a preferred load address for the - kernel. A relocating bootloader should attempt to load at this - address if possible. - - A non-relocatable kernel will unconditionally move itself and to run - at this address. - -Field name: init_size -Type: read -Offset/size: 0x260/4 - - This field indicates the amount of linear contiguous memory starting - at the kernel runtime start address that the kernel needs before it - is capable of examining its memory map. This is not the same thing - as the total amount of memory the kernel needs to boot, but it can - be used by a relocating boot loader to help select a safe load - address for the kernel. - - The kernel runtime start address is determined by the following algorithm: - - if (relocatable_kernel) - runtime_start = align_up(load_address, kernel_alignment) - else - runtime_start = pref_address - -Field name: handover_offset -Type: read -Offset/size: 0x264/4 - - This field is the offset from the beginning of the kernel image to - the EFI handover protocol entry point. Boot loaders using the EFI - handover protocol to boot the kernel should jump to this offset. - - See EFI HANDOVER PROTOCOL below for more details. - - -**** THE IMAGE CHECKSUM - -From boot protocol version 2.08 onwards the CRC-32 is calculated over -the entire file using the characteristic polynomial 0x04C11DB7 and an -initial remainder of 0xffffffff. The checksum is appended to the -file; therefore the CRC of the file up to the limit specified in the -syssize field of the header is always 0. - - -**** THE KERNEL COMMAND LINE - -The kernel command line has become an important way for the boot -loader to communicate with the kernel. Some of its options are also -relevant to the boot loader itself, see "special command line options" -below. - -The kernel command line is a null-terminated string. The maximum -length can be retrieved from the field cmdline_size. Before protocol -version 2.06, the maximum was 255 characters. A string that is too -long will be automatically truncated by the kernel. - -If the boot protocol version is 2.02 or later, the address of the -kernel command line is given by the header field cmd_line_ptr (see -above.) This address can be anywhere between the end of the setup -heap and 0xA0000. - -If the protocol version is *not* 2.02 or higher, the kernel -command line is entered using the following protocol: - - At offset 0x0020 (word), "cmd_line_magic", enter the magic - number 0xA33F. - - At offset 0x0022 (word), "cmd_line_offset", enter the offset - of the kernel command line (relative to the start of the - real-mode kernel). - - The kernel command line *must* be within the memory region - covered by setup_move_size, so you may need to adjust this - field. - - -**** MEMORY LAYOUT OF THE REAL-MODE CODE - -The real-mode code requires a stack/heap to be set up, as well as -memory allocated for the kernel command line. This needs to be done -in the real-mode accessible memory in bottom megabyte. - -It should be noted that modern machines often have a sizable Extended -BIOS Data Area (EBDA). As a result, it is advisable to use as little -of the low megabyte as possible. - -Unfortunately, under the following circumstances the 0x90000 memory -segment has to be used: - - - When loading a zImage kernel ((loadflags & 0x01) == 0). - - When loading a 2.01 or earlier boot protocol kernel. - - -> For the 2.00 and 2.01 boot protocols, the real-mode code - can be loaded at another address, but it is internally - relocated to 0x90000. For the "old" protocol, the - real-mode code must be loaded at 0x90000. - -When loading at 0x90000, avoid using memory above 0x9a000. - -For boot protocol 2.02 or higher, the command line does not have to be -located in the same 64K segment as the real-mode setup code; it is -thus permitted to give the stack/heap the full 64K segment and locate -the command line above it. - -The kernel command line should not be located below the real-mode -code, nor should it be located in high memory. - - -**** SAMPLE BOOT CONFIGURATION - -As a sample configuration, assume the following layout of the real -mode segment: - - When loading below 0x90000, use the entire segment: - - 0x0000-0x7fff Real mode kernel - 0x8000-0xdfff Stack and heap - 0xe000-0xffff Kernel command line - - When loading at 0x90000 OR the protocol version is 2.01 or earlier: - - 0x0000-0x7fff Real mode kernel - 0x8000-0x97ff Stack and heap - 0x9800-0x9fff Kernel command line - -Such a boot loader should enter the following fields in the header: - - unsigned long base_ptr; /* base address for real-mode segment */ - - if ( setup_sects == 0 ) { - setup_sects = 4; - } - - if ( protocol >= 0x0200 ) { - type_of_loader = ; - if ( loading_initrd ) { - ramdisk_image = ; - ramdisk_size = ; - } - - if ( protocol >= 0x0202 && loadflags & 0x01 ) - heap_end = 0xe000; - else - heap_end = 0x9800; - - if ( protocol >= 0x0201 ) { - heap_end_ptr = heap_end - 0x200; - loadflags |= 0x80; /* CAN_USE_HEAP */ - } - - if ( protocol >= 0x0202 ) { - cmd_line_ptr = base_ptr + heap_end; - strcpy(cmd_line_ptr, cmdline); - } else { - cmd_line_magic = 0xA33F; - cmd_line_offset = heap_end; - setup_move_size = heap_end + strlen(cmdline)+1; - strcpy(base_ptr+cmd_line_offset, cmdline); - } - } else { - /* Very old kernel */ - - heap_end = 0x9800; - - cmd_line_magic = 0xA33F; - cmd_line_offset = heap_end; - - /* A very old kernel MUST have its real-mode code - loaded at 0x90000 */ - - if ( base_ptr != 0x90000 ) { - /* Copy the real-mode kernel */ - memcpy(0x90000, base_ptr, (setup_sects+1)*512); - base_ptr = 0x90000; /* Relocated */ - } - - strcpy(0x90000+cmd_line_offset, cmdline); - - /* It is recommended to clear memory up to the 32K mark */ - memset(0x90000 + (setup_sects+1)*512, 0, - (64-(setup_sects+1))*512); - } - - -**** LOADING THE REST OF THE KERNEL - -The 32-bit (non-real-mode) kernel starts at offset (setup_sects+1)*512 -in the kernel file (again, if setup_sects == 0 the real value is 4.) -It should be loaded at address 0x10000 for Image/zImage kernels and -0x100000 for bzImage kernels. - -The kernel is a bzImage kernel if the protocol >= 2.00 and the 0x01 -bit (LOAD_HIGH) in the loadflags field is set: - - is_bzImage = (protocol >= 0x0200) && (loadflags & 0x01); - load_address = is_bzImage ? 0x100000 : 0x10000; - -Note that Image/zImage kernels can be up to 512K in size, and thus use -the entire 0x10000-0x90000 range of memory. This means it is pretty -much a requirement for these kernels to load the real-mode part at -0x90000. bzImage kernels allow much more flexibility. - - -**** SPECIAL COMMAND LINE OPTIONS - -If the command line provided by the boot loader is entered by the -user, the user may expect the following command line options to work. -They should normally not be deleted from the kernel command line even -though not all of them are actually meaningful to the kernel. Boot -loader authors who need additional command line options for the boot -loader itself should get them registered in -Documentation/admin-guide/kernel-parameters.rst to make sure they will not -conflict with actual kernel options now or in the future. - - vga= - here is either an integer (in C notation, either - decimal, octal, or hexadecimal) or one of the strings - "normal" (meaning 0xFFFF), "ext" (meaning 0xFFFE) or "ask" - (meaning 0xFFFD). This value should be entered into the - vid_mode field, as it is used by the kernel before the command - line is parsed. - - mem= - is an integer in C notation optionally followed by - (case insensitive) K, M, G, T, P or E (meaning << 10, << 20, - << 30, << 40, << 50 or << 60). This specifies the end of - memory to the kernel. This affects the possible placement of - an initrd, since an initrd should be placed near end of - memory. Note that this is an option to *both* the kernel and - the bootloader! - - initrd= - An initrd should be loaded. The meaning of is - obviously bootloader-dependent, and some boot loaders - (e.g. LILO) do not have such a command. - -In addition, some boot loaders add the following options to the -user-specified command line: - - BOOT_IMAGE= - The boot image which was loaded. Again, the meaning of - is obviously bootloader-dependent. - - auto - The kernel was booted without explicit user intervention. - -If these options are added by the boot loader, it is highly -recommended that they are located *first*, before the user-specified -or configuration-specified command line. Otherwise, "init=/bin/sh" -gets confused by the "auto" option. - - -**** RUNNING THE KERNEL - -The kernel is started by jumping to the kernel entry point, which is -located at *segment* offset 0x20 from the start of the real mode -kernel. This means that if you loaded your real-mode kernel code at -0x90000, the kernel entry point is 9020:0000. - -At entry, ds = es = ss should point to the start of the real-mode -kernel code (0x9000 if the code is loaded at 0x90000), sp should be -set up properly, normally pointing to the top of the heap, and -interrupts should be disabled. Furthermore, to guard against bugs in -the kernel, it is recommended that the boot loader sets fs = gs = ds = -es = ss. - -In our example from above, we would do: - - /* Note: in the case of the "old" kernel protocol, base_ptr must - be == 0x90000 at this point; see the previous sample code */ - - seg = base_ptr >> 4; - - cli(); /* Enter with interrupts disabled! */ - - /* Set up the real-mode kernel stack */ - _SS = seg; - _SP = heap_end; - - _DS = _ES = _FS = _GS = seg; - jmp_far(seg+0x20, 0); /* Run the kernel */ - -If your boot sector accesses a floppy drive, it is recommended to -switch off the floppy motor before running the kernel, since the -kernel boot leaves interrupts off and thus the motor will not be -switched off, especially if the loaded kernel has the floppy driver as -a demand-loaded module! - - -**** ADVANCED BOOT LOADER HOOKS - -If the boot loader runs in a particularly hostile environment (such as -LOADLIN, which runs under DOS) it may be impossible to follow the -standard memory location requirements. Such a boot loader may use the -following hooks that, if set, are invoked by the kernel at the -appropriate time. The use of these hooks should probably be -considered an absolutely last resort! - -IMPORTANT: All the hooks are required to preserve %esp, %ebp, %esi and -%edi across invocation. - - realmode_swtch: - A 16-bit real mode far subroutine invoked immediately before - entering protected mode. The default routine disables NMI, so - your routine should probably do so, too. - - code32_start: - A 32-bit flat-mode routine *jumped* to immediately after the - transition to protected mode, but before the kernel is - uncompressed. No segments, except CS, are guaranteed to be - set up (current kernels do, but older ones do not); you should - set them up to BOOT_DS (0x18) yourself. - - After completing your hook, you should jump to the address - that was in this field before your boot loader overwrote it - (relocated, if appropriate.) - - -**** 32-bit BOOT PROTOCOL - -For machine with some new BIOS other than legacy BIOS, such as EFI, -LinuxBIOS, etc, and kexec, the 16-bit real mode setup code in kernel -based on legacy BIOS can not be used, so a 32-bit boot protocol needs -to be defined. - -In 32-bit boot protocol, the first step in loading a Linux kernel -should be to setup the boot parameters (struct boot_params, -traditionally known as "zero page"). The memory for struct boot_params -should be allocated and initialized to all zero. Then the setup header -from offset 0x01f1 of kernel image on should be loaded into struct -boot_params and examined. The end of setup header can be calculated as -follow: - - 0x0202 + byte value at offset 0x0201 - -In addition to read/modify/write the setup header of the struct -boot_params as that of 16-bit boot protocol, the boot loader should -also fill the additional fields of the struct boot_params as that -described in zero-page.txt. - -After setting up the struct boot_params, the boot loader can load the -32/64-bit kernel in the same way as that of 16-bit boot protocol. - -In 32-bit boot protocol, the kernel is started by jumping to the -32-bit kernel entry point, which is the start address of loaded -32/64-bit kernel. - -At entry, the CPU must be in 32-bit protected mode with paging -disabled; a GDT must be loaded with the descriptors for selectors -__BOOT_CS(0x10) and __BOOT_DS(0x18); both descriptors must be 4G flat -segment; __BOOT_CS must have execute/read permission, and __BOOT_DS -must have read/write permission; CS must be __BOOT_CS and DS, ES, SS -must be __BOOT_DS; interrupt must be disabled; %esi must hold the base -address of the struct boot_params; %ebp, %edi and %ebx must be zero. - -**** 64-bit BOOT PROTOCOL - -For machine with 64bit cpus and 64bit kernel, we could use 64bit bootloader -and we need a 64-bit boot protocol. - -In 64-bit boot protocol, the first step in loading a Linux kernel -should be to setup the boot parameters (struct boot_params, -traditionally known as "zero page"). The memory for struct boot_params -could be allocated anywhere (even above 4G) and initialized to all zero. -Then, the setup header at offset 0x01f1 of kernel image on should be -loaded into struct boot_params and examined. The end of setup header -can be calculated as follows: - - 0x0202 + byte value at offset 0x0201 - -In addition to read/modify/write the setup header of the struct -boot_params as that of 16-bit boot protocol, the boot loader should -also fill the additional fields of the struct boot_params as described -in zero-page.txt. - -After setting up the struct boot_params, the boot loader can load -64-bit kernel in the same way as that of 16-bit boot protocol, but -kernel could be loaded above 4G. - -In 64-bit boot protocol, the kernel is started by jumping to the -64-bit kernel entry point, which is the start address of loaded -64-bit kernel plus 0x200. - -At entry, the CPU must be in 64-bit mode with paging enabled. -The range with setup_header.init_size from start address of loaded -kernel and zero page and command line buffer get ident mapping; -a GDT must be loaded with the descriptors for selectors -__BOOT_CS(0x10) and __BOOT_DS(0x18); both descriptors must be 4G flat -segment; __BOOT_CS must have execute/read permission, and __BOOT_DS -must have read/write permission; CS must be __BOOT_CS and DS, ES, SS -must be __BOOT_DS; interrupt must be disabled; %rsi must hold the base -address of the struct boot_params. - -**** EFI HANDOVER PROTOCOL - -This protocol allows boot loaders to defer initialisation to the EFI -boot stub. The boot loader is required to load the kernel/initrd(s) -from the boot media and jump to the EFI handover protocol entry point -which is hdr->handover_offset bytes from the beginning of -startup_{32,64}. - -The function prototype for the handover entry point looks like this, - - efi_main(void *handle, efi_system_table_t *table, struct boot_params *bp) - -'handle' is the EFI image handle passed to the boot loader by the EFI -firmware, 'table' is the EFI system table - these are the first two -arguments of the "handoff state" as described in section 2.3 of the -UEFI specification. 'bp' is the boot loader-allocated boot params. - -The boot loader *must* fill out the following fields in bp, - - o hdr.code32_start - o hdr.cmd_line_ptr - o hdr.ramdisk_image (if applicable) - o hdr.ramdisk_size (if applicable) - -All other fields should be zero. diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index 9f34545a9c52..d7fc8efac192 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -7,3 +7,5 @@ x86-specific Documentation .. toctree:: :maxdepth: 2 :numbered: + + boot -- cgit v1.2.3 From 848942cb2ef584752d7c41594b2cc91229fe7f01 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:17 +0800 Subject: Documentation: x86: convert topology.txt to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/index.rst | 1 + Documentation/x86/topology.rst | 221 +++++++++++++++++++++++++++++++++++++++++ Documentation/x86/topology.txt | 217 ---------------------------------------- 3 files changed, 222 insertions(+), 217 deletions(-) create mode 100644 Documentation/x86/topology.rst delete mode 100644 Documentation/x86/topology.txt (limited to 'Documentation/x86') diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index d7fc8efac192..da89bf0ad69f 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -9,3 +9,4 @@ x86-specific Documentation :numbered: boot + topology diff --git a/Documentation/x86/topology.rst b/Documentation/x86/topology.rst new file mode 100644 index 000000000000..5176e5315faa --- /dev/null +++ b/Documentation/x86/topology.rst @@ -0,0 +1,221 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============ +x86 Topology +============ + +This documents and clarifies the main aspects of x86 topology modelling and +representation in the kernel. Update/change when doing changes to the +respective code. + +The architecture-agnostic topology definitions are in +Documentation/cputopology.txt. This file holds x86-specific +differences/specialities which must not necessarily apply to the generic +definitions. Thus, the way to read up on Linux topology on x86 is to start +with the generic one and look at this one in parallel for the x86 specifics. + +Needless to say, code should use the generic functions - this file is *only* +here to *document* the inner workings of x86 topology. + +Started by Thomas Gleixner and Borislav Petkov . + +The main aim of the topology facilities is to present adequate interfaces to +code which needs to know/query/use the structure of the running system wrt +threads, cores, packages, etc. + +The kernel does not care about the concept of physical sockets because a +socket has no relevance to software. It's an electromechanical component. In +the past a socket always contained a single package (see below), but with the +advent of Multi Chip Modules (MCM) a socket can hold more than one package. So +there might be still references to sockets in the code, but they are of +historical nature and should be cleaned up. + +The topology of a system is described in the units of: + + - packages + - cores + - threads + +Package +======= +Packages contain a number of cores plus shared resources, e.g. DRAM +controller, shared caches etc. + +AMD nomenclature for package is 'Node'. + +Package-related topology information in the kernel: + + - cpuinfo_x86.x86_max_cores: + + The number of cores in a package. This information is retrieved via CPUID. + + - cpuinfo_x86.phys_proc_id: + + The physical ID of the package. This information is retrieved via CPUID + and deduced from the APIC IDs of the cores in the package. + + - cpuinfo_x86.logical_id: + + The logical ID of the package. As we do not trust BIOSes to enumerate the + packages in a consistent way, we introduced the concept of logical package + ID so we can sanely calculate the number of maximum possible packages in + the system and have the packages enumerated linearly. + + - topology_max_packages(): + + The maximum possible number of packages in the system. Helpful for per + package facilities to preallocate per package information. + + - cpu_llc_id: + + A per-CPU variable containing: + + - On Intel, the first APIC ID of the list of CPUs sharing the Last Level + Cache + + - On AMD, the Node ID or Core Complex ID containing the Last Level + Cache. In general, it is a number identifying an LLC uniquely on the + system. + +Cores +===== +A core consists of 1 or more threads. It does not matter whether the threads +are SMT- or CMT-type threads. + +AMDs nomenclature for a CMT core is "Compute Unit". The kernel always uses +"core". + +Core-related topology information in the kernel: + + - smp_num_siblings: + + The number of threads in a core. The number of threads in a package can be + calculated by:: + + threads_per_package = cpuinfo_x86.x86_max_cores * smp_num_siblings + + +Threads +======= +A thread is a single scheduling unit. It's the equivalent to a logical Linux +CPU. + +AMDs nomenclature for CMT threads is "Compute Unit Core". The kernel always +uses "thread". + +Thread-related topology information in the kernel: + + - topology_core_cpumask(): + + The cpumask contains all online threads in the package to which a thread + belongs. + + The number of online threads is also printed in /proc/cpuinfo "siblings." + + - topology_sibling_cpumask(): + + The cpumask contains all online threads in the core to which a thread + belongs. + + - topology_logical_package_id(): + + The logical package ID to which a thread belongs. + + - topology_physical_package_id(): + + The physical package ID to which a thread belongs. + + - topology_core_id(); + + The ID of the core to which a thread belongs. It is also printed in /proc/cpuinfo + "core_id." + + + +System topology examples +======================== + +.. note:: + The alternative Linux CPU enumeration depends on how the BIOS enumerates the + threads. Many BIOSes enumerate all threads 0 first and then all threads 1. + That has the "advantage" that the logical Linux CPU numbers of threads 0 stay + the same whether threads are enabled or not. That's merely an implementation + detail and has no practical impact. + +1) Single Package, Single Core:: + + [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 + +2) Single Package, Dual Core + + a) One thread per core:: + + [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 + -> [core 1] -> [thread 0] -> Linux CPU 1 + + b) Two threads per core:: + + [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 + -> [thread 1] -> Linux CPU 1 + -> [core 1] -> [thread 0] -> Linux CPU 2 + -> [thread 1] -> Linux CPU 3 + + Alternative enumeration:: + + [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 + -> [thread 1] -> Linux CPU 2 + -> [core 1] -> [thread 0] -> Linux CPU 1 + -> [thread 1] -> Linux CPU 3 + + AMD nomenclature for CMT systems:: + + [node 0] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 0 + -> [Compute Unit Core 1] -> Linux CPU 1 + -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 2 + -> [Compute Unit Core 1] -> Linux CPU 3 + +4) Dual Package, Dual Core + + a) One thread per core:: + + [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 + -> [core 1] -> [thread 0] -> Linux CPU 1 + + [package 1] -> [core 0] -> [thread 0] -> Linux CPU 2 + -> [core 1] -> [thread 0] -> Linux CPU 3 + + b) Two threads per core:: + + [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 + -> [thread 1] -> Linux CPU 1 + -> [core 1] -> [thread 0] -> Linux CPU 2 + -> [thread 1] -> Linux CPU 3 + + [package 1] -> [core 0] -> [thread 0] -> Linux CPU 4 + -> [thread 1] -> Linux CPU 5 + -> [core 1] -> [thread 0] -> Linux CPU 6 + -> [thread 1] -> Linux CPU 7 + + Alternative enumeration:: + + [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 + -> [thread 1] -> Linux CPU 4 + -> [core 1] -> [thread 0] -> Linux CPU 1 + -> [thread 1] -> Linux CPU 5 + + [package 1] -> [core 0] -> [thread 0] -> Linux CPU 2 + -> [thread 1] -> Linux CPU 6 + -> [core 1] -> [thread 0] -> Linux CPU 3 + -> [thread 1] -> Linux CPU 7 + + AMD nomenclature for CMT systems:: + + [node 0] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 0 + -> [Compute Unit Core 1] -> Linux CPU 1 + -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 2 + -> [Compute Unit Core 1] -> Linux CPU 3 + + [node 1] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 4 + -> [Compute Unit Core 1] -> Linux CPU 5 + -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 6 + -> [Compute Unit Core 1] -> Linux CPU 7 diff --git a/Documentation/x86/topology.txt b/Documentation/x86/topology.txt deleted file mode 100644 index 2953e3ec9a02..000000000000 --- a/Documentation/x86/topology.txt +++ /dev/null @@ -1,217 +0,0 @@ -x86 Topology -============ - -This documents and clarifies the main aspects of x86 topology modelling and -representation in the kernel. Update/change when doing changes to the -respective code. - -The architecture-agnostic topology definitions are in -Documentation/cputopology.txt. This file holds x86-specific -differences/specialities which must not necessarily apply to the generic -definitions. Thus, the way to read up on Linux topology on x86 is to start -with the generic one and look at this one in parallel for the x86 specifics. - -Needless to say, code should use the generic functions - this file is *only* -here to *document* the inner workings of x86 topology. - -Started by Thomas Gleixner and Borislav Petkov . - -The main aim of the topology facilities is to present adequate interfaces to -code which needs to know/query/use the structure of the running system wrt -threads, cores, packages, etc. - -The kernel does not care about the concept of physical sockets because a -socket has no relevance to software. It's an electromechanical component. In -the past a socket always contained a single package (see below), but with the -advent of Multi Chip Modules (MCM) a socket can hold more than one package. So -there might be still references to sockets in the code, but they are of -historical nature and should be cleaned up. - -The topology of a system is described in the units of: - - - packages - - cores - - threads - -* Package: - - Packages contain a number of cores plus shared resources, e.g. DRAM - controller, shared caches etc. - - AMD nomenclature for package is 'Node'. - - Package-related topology information in the kernel: - - - cpuinfo_x86.x86_max_cores: - - The number of cores in a package. This information is retrieved via CPUID. - - - cpuinfo_x86.phys_proc_id: - - The physical ID of the package. This information is retrieved via CPUID - and deduced from the APIC IDs of the cores in the package. - - - cpuinfo_x86.logical_id: - - The logical ID of the package. As we do not trust BIOSes to enumerate the - packages in a consistent way, we introduced the concept of logical package - ID so we can sanely calculate the number of maximum possible packages in - the system and have the packages enumerated linearly. - - - topology_max_packages(): - - The maximum possible number of packages in the system. Helpful for per - package facilities to preallocate per package information. - - - cpu_llc_id: - - A per-CPU variable containing: - - On Intel, the first APIC ID of the list of CPUs sharing the Last Level - Cache - - - On AMD, the Node ID or Core Complex ID containing the Last Level - Cache. In general, it is a number identifying an LLC uniquely on the - system. - -* Cores: - - A core consists of 1 or more threads. It does not matter whether the threads - are SMT- or CMT-type threads. - - AMDs nomenclature for a CMT core is "Compute Unit". The kernel always uses - "core". - - Core-related topology information in the kernel: - - - smp_num_siblings: - - The number of threads in a core. The number of threads in a package can be - calculated by: - - threads_per_package = cpuinfo_x86.x86_max_cores * smp_num_siblings - - -* Threads: - - A thread is a single scheduling unit. It's the equivalent to a logical Linux - CPU. - - AMDs nomenclature for CMT threads is "Compute Unit Core". The kernel always - uses "thread". - - Thread-related topology information in the kernel: - - - topology_core_cpumask(): - - The cpumask contains all online threads in the package to which a thread - belongs. - - The number of online threads is also printed in /proc/cpuinfo "siblings." - - - topology_sibling_cpumask(): - - The cpumask contains all online threads in the core to which a thread - belongs. - - - topology_logical_package_id(): - - The logical package ID to which a thread belongs. - - - topology_physical_package_id(): - - The physical package ID to which a thread belongs. - - - topology_core_id(); - - The ID of the core to which a thread belongs. It is also printed in /proc/cpuinfo - "core_id." - - - -System topology examples - -Note: - -The alternative Linux CPU enumeration depends on how the BIOS enumerates the -threads. Many BIOSes enumerate all threads 0 first and then all threads 1. -That has the "advantage" that the logical Linux CPU numbers of threads 0 stay -the same whether threads are enabled or not. That's merely an implementation -detail and has no practical impact. - -1) Single Package, Single Core - - [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 - -2) Single Package, Dual Core - - a) One thread per core - - [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 - -> [core 1] -> [thread 0] -> Linux CPU 1 - - b) Two threads per core - - [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 - -> [thread 1] -> Linux CPU 1 - -> [core 1] -> [thread 0] -> Linux CPU 2 - -> [thread 1] -> Linux CPU 3 - - Alternative enumeration: - - [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 - -> [thread 1] -> Linux CPU 2 - -> [core 1] -> [thread 0] -> Linux CPU 1 - -> [thread 1] -> Linux CPU 3 - - AMD nomenclature for CMT systems: - - [node 0] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 0 - -> [Compute Unit Core 1] -> Linux CPU 1 - -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 2 - -> [Compute Unit Core 1] -> Linux CPU 3 - -4) Dual Package, Dual Core - - a) One thread per core - - [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 - -> [core 1] -> [thread 0] -> Linux CPU 1 - - [package 1] -> [core 0] -> [thread 0] -> Linux CPU 2 - -> [core 1] -> [thread 0] -> Linux CPU 3 - - b) Two threads per core - - [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 - -> [thread 1] -> Linux CPU 1 - -> [core 1] -> [thread 0] -> Linux CPU 2 - -> [thread 1] -> Linux CPU 3 - - [package 1] -> [core 0] -> [thread 0] -> Linux CPU 4 - -> [thread 1] -> Linux CPU 5 - -> [core 1] -> [thread 0] -> Linux CPU 6 - -> [thread 1] -> Linux CPU 7 - - Alternative enumeration: - - [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 - -> [thread 1] -> Linux CPU 4 - -> [core 1] -> [thread 0] -> Linux CPU 1 - -> [thread 1] -> Linux CPU 5 - - [package 1] -> [core 0] -> [thread 0] -> Linux CPU 2 - -> [thread 1] -> Linux CPU 6 - -> [core 1] -> [thread 0] -> Linux CPU 3 - -> [thread 1] -> Linux CPU 7 - - AMD nomenclature for CMT systems: - - [node 0] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 0 - -> [Compute Unit Core 1] -> Linux CPU 1 - -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 2 - -> [Compute Unit Core 1] -> Linux CPU 3 - - [node 1] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 4 - -> [Compute Unit Core 1] -> Linux CPU 5 - -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 6 - -> [Compute Unit Core 1] -> Linux CPU 7 -- cgit v1.2.3 From 06955392a95cc2918c7881eddabe30e5d8cd0a53 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:18 +0800 Subject: Documentation: x86: convert exception-tables.txt to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/exception-tables.rst | 346 +++++++++++++++++++++++++++++++++ Documentation/x86/exception-tables.txt | 327 ------------------------------- Documentation/x86/index.rst | 1 + 3 files changed, 347 insertions(+), 327 deletions(-) create mode 100644 Documentation/x86/exception-tables.rst delete mode 100644 Documentation/x86/exception-tables.txt (limited to 'Documentation/x86') diff --git a/Documentation/x86/exception-tables.rst b/Documentation/x86/exception-tables.rst new file mode 100644 index 000000000000..24596c8210b5 --- /dev/null +++ b/Documentation/x86/exception-tables.rst @@ -0,0 +1,346 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================== +Kernel level exception handling +=============================== + +Commentary by Joerg Pommnitz + +When a process runs in kernel mode, it often has to access user +mode memory whose address has been passed by an untrusted program. +To protect itself the kernel has to verify this address. + +In older versions of Linux this was done with the +int verify_area(int type, const void * addr, unsigned long size) +function (which has since been replaced by access_ok()). + +This function verified that the memory area starting at address +'addr' and of size 'size' was accessible for the operation specified +in type (read or write). To do this, verify_read had to look up the +virtual memory area (vma) that contained the address addr. In the +normal case (correctly working program), this test was successful. +It only failed for a few buggy programs. In some kernel profiling +tests, this normally unneeded verification used up a considerable +amount of time. + +To overcome this situation, Linus decided to let the virtual memory +hardware present in every Linux-capable CPU handle this test. + +How does this work? + +Whenever the kernel tries to access an address that is currently not +accessible, the CPU generates a page fault exception and calls the +page fault handler:: + + void do_page_fault(struct pt_regs *regs, unsigned long error_code) + +in arch/x86/mm/fault.c. The parameters on the stack are set up by +the low level assembly glue in arch/x86/kernel/entry_32.S. The parameter +regs is a pointer to the saved registers on the stack, error_code +contains a reason code for the exception. + +do_page_fault first obtains the unaccessible address from the CPU +control register CR2. If the address is within the virtual address +space of the process, the fault probably occurred, because the page +was not swapped in, write protected or something similar. However, +we are interested in the other case: the address is not valid, there +is no vma that contains this address. In this case, the kernel jumps +to the bad_area label. + +There it uses the address of the instruction that caused the exception +(i.e. regs->eip) to find an address where the execution can continue +(fixup). If this search is successful, the fault handler modifies the +return address (again regs->eip) and returns. The execution will +continue at the address in fixup. + +Where does fixup point to? + +Since we jump to the contents of fixup, fixup obviously points +to executable code. This code is hidden inside the user access macros. +I have picked the get_user macro defined in arch/x86/include/asm/uaccess.h +as an example. The definition is somewhat hard to follow, so let's peek at +the code generated by the preprocessor and the compiler. I selected +the get_user call in drivers/char/sysrq.c for a detailed examination. + +The original code in sysrq.c line 587:: + + get_user(c, buf); + +The preprocessor output (edited to become somewhat readable):: + + ( + { + long __gu_err = - 14 , __gu_val = 0; + const __typeof__(*( ( buf ) )) *__gu_addr = ((buf)); + if (((((0 + current_set[0])->tss.segment) == 0x18 ) || + (((sizeof(*(buf))) <= 0xC0000000UL) && + ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf))))))) + do { + __gu_err = 0; + switch ((sizeof(*(buf)))) { + case 1: + __asm__ __volatile__( + "1: mov" "b" " %2,%" "b" "1\n" + "2:\n" + ".section .fixup,\"ax\"\n" + "3: movl %3,%0\n" + " xor" "b" " %" "b" "1,%" "b" "1\n" + " jmp 2b\n" + ".section __ex_table,\"a\"\n" + " .align 4\n" + " .long 1b,3b\n" + ".text" : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *) + ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )) ; + break; + case 2: + __asm__ __volatile__( + "1: mov" "w" " %2,%" "w" "1\n" + "2:\n" + ".section .fixup,\"ax\"\n" + "3: movl %3,%0\n" + " xor" "w" " %" "w" "1,%" "w" "1\n" + " jmp 2b\n" + ".section __ex_table,\"a\"\n" + " .align 4\n" + " .long 1b,3b\n" + ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) + ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )); + break; + case 4: + __asm__ __volatile__( + "1: mov" "l" " %2,%" "" "1\n" + "2:\n" + ".section .fixup,\"ax\"\n" + "3: movl %3,%0\n" + " xor" "l" " %" "" "1,%" "" "1\n" + " jmp 2b\n" + ".section __ex_table,\"a\"\n" + " .align 4\n" " .long 1b,3b\n" + ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) + ( __gu_addr )) ), "i"(- 14 ), "0"(__gu_err)); + break; + default: + (__gu_val) = __get_user_bad(); + } + } while (0) ; + ((c)) = (__typeof__(*((buf))))__gu_val; + __gu_err; + } + ); + +WOW! Black GCC/assembly magic. This is impossible to follow, so let's +see what code gcc generates:: + + > xorl %edx,%edx + > movl current_set,%eax + > cmpl $24,788(%eax) + > je .L1424 + > cmpl $-1073741825,64(%esp) + > ja .L1423 + > .L1424: + > movl %edx,%eax + > movl 64(%esp),%ebx + > #APP + > 1: movb (%ebx),%dl /* this is the actual user access */ + > 2: + > .section .fixup,"ax" + > 3: movl $-14,%eax + > xorb %dl,%dl + > jmp 2b + > .section __ex_table,"a" + > .align 4 + > .long 1b,3b + > .text + > #NO_APP + > .L1423: + > movzbl %dl,%esi + +The optimizer does a good job and gives us something we can actually +understand. Can we? The actual user access is quite obvious. Thanks +to the unified address space we can just access the address in user +memory. But what does the .section stuff do????? + +To understand this we have to look at the final kernel:: + + > objdump --section-headers vmlinux + > + > vmlinux: file format elf32-i386 + > + > Sections: + > Idx Name Size VMA LMA File off Algn + > 0 .text 00098f40 c0100000 c0100000 00001000 2**4 + > CONTENTS, ALLOC, LOAD, READONLY, CODE + > 1 .fixup 000016bc c0198f40 c0198f40 00099f40 2**0 + > CONTENTS, ALLOC, LOAD, READONLY, CODE + > 2 .rodata 0000f127 c019a5fc c019a5fc 0009b5fc 2**2 + > CONTENTS, ALLOC, LOAD, READONLY, DATA + > 3 __ex_table 000015c0 c01a9724 c01a9724 000aa724 2**2 + > CONTENTS, ALLOC, LOAD, READONLY, DATA + > 4 .data 0000ea58 c01abcf0 c01abcf0 000abcf0 2**4 + > CONTENTS, ALLOC, LOAD, DATA + > 5 .bss 00018e21 c01ba748 c01ba748 000ba748 2**2 + > ALLOC + > 6 .comment 00000ec4 00000000 00000000 000ba748 2**0 + > CONTENTS, READONLY + > 7 .note 00001068 00000ec4 00000ec4 000bb60c 2**0 + > CONTENTS, READONLY + +There are obviously 2 non standard ELF sections in the generated object +file. But first we want to find out what happened to our code in the +final kernel executable:: + + > objdump --disassemble --section=.text vmlinux + > + > c017e785 xorl %edx,%edx + > c017e787 movl 0xc01c7bec,%eax + > c017e78c cmpl $0x18,0x314(%eax) + > c017e793 je c017e79f + > c017e795 cmpl $0xbfffffff,0x40(%esp,1) + > c017e79d ja c017e7a7 + > c017e79f movl %edx,%eax + > c017e7a1 movl 0x40(%esp,1),%ebx + > c017e7a5 movb (%ebx),%dl + > c017e7a7 movzbl %dl,%esi + +The whole user memory access is reduced to 10 x86 machine instructions. +The instructions bracketed in the .section directives are no longer +in the normal execution path. They are located in a different section +of the executable file:: + + > objdump --disassemble --section=.fixup vmlinux + > + > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax + > c0199ffa <.fixup+10ba> xorb %dl,%dl + > c0199ffc <.fixup+10bc> jmp c017e7a7 + +And finally:: + + > objdump --full-contents --section=__ex_table vmlinux + > + > c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0 ................ + > c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0 ................ + > c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0 ................ + +or in human readable byte order:: + + > c01aa7c4 c017c093 c0199fe0 c017c097 c017c099 ................ + > c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................ + ^^^^^^^^^^^^^^^^^ + this is the interesting part! + > c01aa7e4 c0180a08 c019a001 c0180a0a c019a004 ................ + +What happened? The assembly directives:: + + .section .fixup,"ax" + .section __ex_table,"a" + +told the assembler to move the following code to the specified +sections in the ELF object file. So the instructions:: + + 3: movl $-14,%eax + xorb %dl,%dl + jmp 2b + +ended up in the .fixup section of the object file and the addresses:: + + .long 1b,3b + +ended up in the __ex_table section of the object file. 1b and 3b +are local labels. The local label 1b (1b stands for next label 1 +backward) is the address of the instruction that might fault, i.e. +in our case the address of the label 1 is c017e7a5: +the original assembly code: > 1: movb (%ebx),%dl +and linked in vmlinux : > c017e7a5 movb (%ebx),%dl + +The local label 3 (backwards again) is the address of the code to handle +the fault, in our case the actual value is c0199ff5: +the original assembly code: > 3: movl $-14,%eax +and linked in vmlinux : > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax + +The assembly code:: + + > .section __ex_table,"a" + > .align 4 + > .long 1b,3b + +becomes the value pair:: + + > c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................ + ^this is ^this is + 1b 3b + +c017e7a5,c0199ff5 in the exception table of the kernel. + +So, what actually happens if a fault from kernel mode with no suitable +vma occurs? + +#. access to invalid address:: + + > c017e7a5 movb (%ebx),%dl +#. MMU generates exception +#. CPU calls do_page_fault +#. do page fault calls search_exception_table (regs->eip == c017e7a5); +#. search_exception_table looks up the address c017e7a5 in the + exception table (i.e. the contents of the ELF section __ex_table) + and returns the address of the associated fault handle code c0199ff5. +#. do_page_fault modifies its own return address to point to the fault + handle code and returns. +#. execution continues in the fault handling code. +#. a) EAX becomes -EFAULT (== -14) + b) DL becomes zero (the value we "read" from user space) + c) execution continues at local label 2 (address of the + instruction immediately after the faulting user access). + +The steps 8a to 8c in a certain way emulate the faulting instruction. + +That's it, mostly. If you look at our example, you might ask why +we set EAX to -EFAULT in the exception handler code. Well, the +get_user macro actually returns a value: 0, if the user access was +successful, -EFAULT on failure. Our original code did not test this +return value, however the inline assembly code in get_user tries to +return -EFAULT. GCC selected EAX to return this value. + +NOTE: +Due to the way that the exception table is built and needs to be ordered, +only use exceptions for code in the .text section. Any other section +will cause the exception table to not be sorted correctly, and the +exceptions will fail. + +Things changed when 64-bit support was added to x86 Linux. Rather than +double the size of the exception table by expanding the two entries +from 32-bits to 64 bits, a clever trick was used to store addresses +as relative offsets from the table itself. The assembly code changed +from:: + + .long 1b,3b + to: + .long (from) - . + .long (to) - . + +and the C-code that uses these values converts back to absolute addresses +like this:: + + ex_insn_addr(const struct exception_table_entry *x) + { + return (unsigned long)&x->insn + x->insn; + } + +In v4.6 the exception table entry was expanded with a new field "handler". +This is also 32-bits wide and contains a third relative function +pointer which points to one of: + +1) ``int ex_handler_default(const struct exception_table_entry *fixup)`` + This is legacy case that just jumps to the fixup code + +2) ``int ex_handler_fault(const struct exception_table_entry *fixup)`` + This case provides the fault number of the trap that occurred at + entry->insn. It is used to distinguish page faults from machine + check. + +3) ``int ex_handler_ext(const struct exception_table_entry *fixup)`` + This case is used for uaccess_err ... we need to set a flag + in the task structure. Before the handler functions existed this + case was handled by adding a large offset to the fixup to tag + it as special. + +More functions can easily be added. diff --git a/Documentation/x86/exception-tables.txt b/Documentation/x86/exception-tables.txt deleted file mode 100644 index e396bcd8d830..000000000000 --- a/Documentation/x86/exception-tables.txt +++ /dev/null @@ -1,327 +0,0 @@ - Kernel level exception handling in Linux - Commentary by Joerg Pommnitz - -When a process runs in kernel mode, it often has to access user -mode memory whose address has been passed by an untrusted program. -To protect itself the kernel has to verify this address. - -In older versions of Linux this was done with the -int verify_area(int type, const void * addr, unsigned long size) -function (which has since been replaced by access_ok()). - -This function verified that the memory area starting at address -'addr' and of size 'size' was accessible for the operation specified -in type (read or write). To do this, verify_read had to look up the -virtual memory area (vma) that contained the address addr. In the -normal case (correctly working program), this test was successful. -It only failed for a few buggy programs. In some kernel profiling -tests, this normally unneeded verification used up a considerable -amount of time. - -To overcome this situation, Linus decided to let the virtual memory -hardware present in every Linux-capable CPU handle this test. - -How does this work? - -Whenever the kernel tries to access an address that is currently not -accessible, the CPU generates a page fault exception and calls the -page fault handler - -void do_page_fault(struct pt_regs *regs, unsigned long error_code) - -in arch/x86/mm/fault.c. The parameters on the stack are set up by -the low level assembly glue in arch/x86/kernel/entry_32.S. The parameter -regs is a pointer to the saved registers on the stack, error_code -contains a reason code for the exception. - -do_page_fault first obtains the unaccessible address from the CPU -control register CR2. If the address is within the virtual address -space of the process, the fault probably occurred, because the page -was not swapped in, write protected or something similar. However, -we are interested in the other case: the address is not valid, there -is no vma that contains this address. In this case, the kernel jumps -to the bad_area label. - -There it uses the address of the instruction that caused the exception -(i.e. regs->eip) to find an address where the execution can continue -(fixup). If this search is successful, the fault handler modifies the -return address (again regs->eip) and returns. The execution will -continue at the address in fixup. - -Where does fixup point to? - -Since we jump to the contents of fixup, fixup obviously points -to executable code. This code is hidden inside the user access macros. -I have picked the get_user macro defined in arch/x86/include/asm/uaccess.h -as an example. The definition is somewhat hard to follow, so let's peek at -the code generated by the preprocessor and the compiler. I selected -the get_user call in drivers/char/sysrq.c for a detailed examination. - -The original code in sysrq.c line 587: - get_user(c, buf); - -The preprocessor output (edited to become somewhat readable): - -( - { - long __gu_err = - 14 , __gu_val = 0; - const __typeof__(*( ( buf ) )) *__gu_addr = ((buf)); - if (((((0 + current_set[0])->tss.segment) == 0x18 ) || - (((sizeof(*(buf))) <= 0xC0000000UL) && - ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf))))))) - do { - __gu_err = 0; - switch ((sizeof(*(buf)))) { - case 1: - __asm__ __volatile__( - "1: mov" "b" " %2,%" "b" "1\n" - "2:\n" - ".section .fixup,\"ax\"\n" - "3: movl %3,%0\n" - " xor" "b" " %" "b" "1,%" "b" "1\n" - " jmp 2b\n" - ".section __ex_table,\"a\"\n" - " .align 4\n" - " .long 1b,3b\n" - ".text" : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *) - ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )) ; - break; - case 2: - __asm__ __volatile__( - "1: mov" "w" " %2,%" "w" "1\n" - "2:\n" - ".section .fixup,\"ax\"\n" - "3: movl %3,%0\n" - " xor" "w" " %" "w" "1,%" "w" "1\n" - " jmp 2b\n" - ".section __ex_table,\"a\"\n" - " .align 4\n" - " .long 1b,3b\n" - ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) - ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )); - break; - case 4: - __asm__ __volatile__( - "1: mov" "l" " %2,%" "" "1\n" - "2:\n" - ".section .fixup,\"ax\"\n" - "3: movl %3,%0\n" - " xor" "l" " %" "" "1,%" "" "1\n" - " jmp 2b\n" - ".section __ex_table,\"a\"\n" - " .align 4\n" " .long 1b,3b\n" - ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) - ( __gu_addr )) ), "i"(- 14 ), "0"(__gu_err)); - break; - default: - (__gu_val) = __get_user_bad(); - } - } while (0) ; - ((c)) = (__typeof__(*((buf))))__gu_val; - __gu_err; - } -); - -WOW! Black GCC/assembly magic. This is impossible to follow, so let's -see what code gcc generates: - - > xorl %edx,%edx - > movl current_set,%eax - > cmpl $24,788(%eax) - > je .L1424 - > cmpl $-1073741825,64(%esp) - > ja .L1423 - > .L1424: - > movl %edx,%eax - > movl 64(%esp),%ebx - > #APP - > 1: movb (%ebx),%dl /* this is the actual user access */ - > 2: - > .section .fixup,"ax" - > 3: movl $-14,%eax - > xorb %dl,%dl - > jmp 2b - > .section __ex_table,"a" - > .align 4 - > .long 1b,3b - > .text - > #NO_APP - > .L1423: - > movzbl %dl,%esi - -The optimizer does a good job and gives us something we can actually -understand. Can we? The actual user access is quite obvious. Thanks -to the unified address space we can just access the address in user -memory. But what does the .section stuff do????? - -To understand this we have to look at the final kernel: - - > objdump --section-headers vmlinux - > - > vmlinux: file format elf32-i386 - > - > Sections: - > Idx Name Size VMA LMA File off Algn - > 0 .text 00098f40 c0100000 c0100000 00001000 2**4 - > CONTENTS, ALLOC, LOAD, READONLY, CODE - > 1 .fixup 000016bc c0198f40 c0198f40 00099f40 2**0 - > CONTENTS, ALLOC, LOAD, READONLY, CODE - > 2 .rodata 0000f127 c019a5fc c019a5fc 0009b5fc 2**2 - > CONTENTS, ALLOC, LOAD, READONLY, DATA - > 3 __ex_table 000015c0 c01a9724 c01a9724 000aa724 2**2 - > CONTENTS, ALLOC, LOAD, READONLY, DATA - > 4 .data 0000ea58 c01abcf0 c01abcf0 000abcf0 2**4 - > CONTENTS, ALLOC, LOAD, DATA - > 5 .bss 00018e21 c01ba748 c01ba748 000ba748 2**2 - > ALLOC - > 6 .comment 00000ec4 00000000 00000000 000ba748 2**0 - > CONTENTS, READONLY - > 7 .note 00001068 00000ec4 00000ec4 000bb60c 2**0 - > CONTENTS, READONLY - -There are obviously 2 non standard ELF sections in the generated object -file. But first we want to find out what happened to our code in the -final kernel executable: - - > objdump --disassemble --section=.text vmlinux - > - > c017e785 xorl %edx,%edx - > c017e787 movl 0xc01c7bec,%eax - > c017e78c cmpl $0x18,0x314(%eax) - > c017e793 je c017e79f - > c017e795 cmpl $0xbfffffff,0x40(%esp,1) - > c017e79d ja c017e7a7 - > c017e79f movl %edx,%eax - > c017e7a1 movl 0x40(%esp,1),%ebx - > c017e7a5 movb (%ebx),%dl - > c017e7a7 movzbl %dl,%esi - -The whole user memory access is reduced to 10 x86 machine instructions. -The instructions bracketed in the .section directives are no longer -in the normal execution path. They are located in a different section -of the executable file: - - > objdump --disassemble --section=.fixup vmlinux - > - > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax - > c0199ffa <.fixup+10ba> xorb %dl,%dl - > c0199ffc <.fixup+10bc> jmp c017e7a7 - -And finally: - > objdump --full-contents --section=__ex_table vmlinux - > - > c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0 ................ - > c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0 ................ - > c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0 ................ - -or in human readable byte order: - - > c01aa7c4 c017c093 c0199fe0 c017c097 c017c099 ................ - > c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................ - ^^^^^^^^^^^^^^^^^ - this is the interesting part! - > c01aa7e4 c0180a08 c019a001 c0180a0a c019a004 ................ - -What happened? The assembly directives - -.section .fixup,"ax" -.section __ex_table,"a" - -told the assembler to move the following code to the specified -sections in the ELF object file. So the instructions -3: movl $-14,%eax - xorb %dl,%dl - jmp 2b -ended up in the .fixup section of the object file and the addresses - .long 1b,3b -ended up in the __ex_table section of the object file. 1b and 3b -are local labels. The local label 1b (1b stands for next label 1 -backward) is the address of the instruction that might fault, i.e. -in our case the address of the label 1 is c017e7a5: -the original assembly code: > 1: movb (%ebx),%dl -and linked in vmlinux : > c017e7a5 movb (%ebx),%dl - -The local label 3 (backwards again) is the address of the code to handle -the fault, in our case the actual value is c0199ff5: -the original assembly code: > 3: movl $-14,%eax -and linked in vmlinux : > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax - -The assembly code - > .section __ex_table,"a" - > .align 4 - > .long 1b,3b - -becomes the value pair - > c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................ - ^this is ^this is - 1b 3b -c017e7a5,c0199ff5 in the exception table of the kernel. - -So, what actually happens if a fault from kernel mode with no suitable -vma occurs? - -1.) access to invalid address: - > c017e7a5 movb (%ebx),%dl -2.) MMU generates exception -3.) CPU calls do_page_fault -4.) do page fault calls search_exception_table (regs->eip == c017e7a5); -5.) search_exception_table looks up the address c017e7a5 in the - exception table (i.e. the contents of the ELF section __ex_table) - and returns the address of the associated fault handle code c0199ff5. -6.) do_page_fault modifies its own return address to point to the fault - handle code and returns. -7.) execution continues in the fault handling code. -8.) 8a) EAX becomes -EFAULT (== -14) - 8b) DL becomes zero (the value we "read" from user space) - 8c) execution continues at local label 2 (address of the - instruction immediately after the faulting user access). - -The steps 8a to 8c in a certain way emulate the faulting instruction. - -That's it, mostly. If you look at our example, you might ask why -we set EAX to -EFAULT in the exception handler code. Well, the -get_user macro actually returns a value: 0, if the user access was -successful, -EFAULT on failure. Our original code did not test this -return value, however the inline assembly code in get_user tries to -return -EFAULT. GCC selected EAX to return this value. - -NOTE: -Due to the way that the exception table is built and needs to be ordered, -only use exceptions for code in the .text section. Any other section -will cause the exception table to not be sorted correctly, and the -exceptions will fail. - -Things changed when 64-bit support was added to x86 Linux. Rather than -double the size of the exception table by expanding the two entries -from 32-bits to 64 bits, a clever trick was used to store addresses -as relative offsets from the table itself. The assembly code changed -from: - .long 1b,3b -to: - .long (from) - . - .long (to) - . - -and the C-code that uses these values converts back to absolute addresses -like this: - - ex_insn_addr(const struct exception_table_entry *x) - { - return (unsigned long)&x->insn + x->insn; - } - -In v4.6 the exception table entry was expanded with a new field "handler". -This is also 32-bits wide and contains a third relative function -pointer which points to one of: - -1) int ex_handler_default(const struct exception_table_entry *fixup) - This is legacy case that just jumps to the fixup code -2) int ex_handler_fault(const struct exception_table_entry *fixup) - This case provides the fault number of the trap that occurred at - entry->insn. It is used to distinguish page faults from machine - check. -3) int ex_handler_ext(const struct exception_table_entry *fixup) - This case is used for uaccess_err ... we need to set a flag - in the task structure. Before the handler functions existed this - case was handled by adding a large offset to the fixup to tag - it as special. -More functions can easily be added. diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index da89bf0ad69f..c855b730bab4 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -10,3 +10,4 @@ x86-specific Documentation boot topology + exception-tables -- cgit v1.2.3 From ac2b4687dadd4e7703c0e985f5a5a0fdaab53316 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:19 +0800 Subject: Documentation: x86: convert kernel-stacks to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/index.rst | 1 + Documentation/x86/kernel-stacks | 141 ---------------------------------- Documentation/x86/kernel-stacks.rst | 147 ++++++++++++++++++++++++++++++++++++ 3 files changed, 148 insertions(+), 141 deletions(-) delete mode 100644 Documentation/x86/kernel-stacks create mode 100644 Documentation/x86/kernel-stacks.rst (limited to 'Documentation/x86') diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index c855b730bab4..f6f4e0fc79f2 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -11,3 +11,4 @@ x86-specific Documentation boot topology exception-tables + kernel-stacks diff --git a/Documentation/x86/kernel-stacks b/Documentation/x86/kernel-stacks deleted file mode 100644 index 9a0aa4d3a866..000000000000 --- a/Documentation/x86/kernel-stacks +++ /dev/null @@ -1,141 +0,0 @@ -Kernel stacks on x86-64 bit ---------------------------- - -Most of the text from Keith Owens, hacked by AK - -x86_64 page size (PAGE_SIZE) is 4K. - -Like all other architectures, x86_64 has a kernel stack for every -active thread. These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big. -These stacks contain useful data as long as a thread is alive or a -zombie. While the thread is in user space the kernel stack is empty -except for the thread_info structure at the bottom. - -In addition to the per thread stacks, there are specialized stacks -associated with each CPU. These stacks are only used while the kernel -is in control on that CPU; when a CPU returns to user space the -specialized stacks contain no useful data. The main CPU stacks are: - -* Interrupt stack. IRQ_STACK_SIZE - - Used for external hardware interrupts. If this is the first external - hardware interrupt (i.e. not a nested hardware interrupt) then the - kernel switches from the current task to the interrupt stack. Like - the split thread and interrupt stacks on i386, this gives more room - for kernel interrupt processing without having to increase the size - of every per thread stack. - - The interrupt stack is also used when processing a softirq. - -Switching to the kernel interrupt stack is done by software based on a -per CPU interrupt nest counter. This is needed because x86-64 "IST" -hardware stacks cannot nest without races. - -x86_64 also has a feature which is not available on i386, the ability -to automatically switch to a new stack for designated events such as -double fault or NMI, which makes it easier to handle these unusual -events on x86_64. This feature is called the Interrupt Stack Table -(IST). There can be up to 7 IST entries per CPU. The IST code is an -index into the Task State Segment (TSS). The IST entries in the TSS -point to dedicated stacks; each stack can be a different size. - -An IST is selected by a non-zero value in the IST field of an -interrupt-gate descriptor. When an interrupt occurs and the hardware -loads such a descriptor, the hardware automatically sets the new stack -pointer based on the IST value, then invokes the interrupt handler. If -the interrupt came from user mode, then the interrupt handler prologue -will switch back to the per-thread stack. If software wants to allow -nested IST interrupts then the handler must adjust the IST values on -entry to and exit from the interrupt handler. (This is occasionally -done, e.g. for debug exceptions.) - -Events with different IST codes (i.e. with different stacks) can be -nested. For example, a debug interrupt can safely be interrupted by an -NMI. arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack -pointers on entry to and exit from all IST events, in theory allowing -IST events with the same code to be nested. However in most cases, the -stack size allocated to an IST assumes no nesting for the same code. -If that assumption is ever broken then the stacks will become corrupt. - -The currently assigned IST stacks are :- - -* DOUBLEFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE). - - Used for interrupt 8 - Double Fault Exception (#DF). - - Invoked when handling one exception causes another exception. Happens - when the kernel is very confused (e.g. kernel stack pointer corrupt). - Using a separate stack allows the kernel to recover from it well enough - in many cases to still output an oops. - -* NMI_STACK. EXCEPTION_STKSZ (PAGE_SIZE). - - Used for non-maskable interrupts (NMI). - - NMI can be delivered at any time, including when the kernel is in the - middle of switching stacks. Using IST for NMI events avoids making - assumptions about the previous state of the kernel stack. - -* DEBUG_STACK. DEBUG_STKSZ - - Used for hardware debug interrupts (interrupt 1) and for software - debug interrupts (INT3). - - When debugging a kernel, debug interrupts (both hardware and - software) can occur at any time. Using IST for these interrupts - avoids making assumptions about the previous state of the kernel - stack. - -* MCE_STACK. EXCEPTION_STKSZ (PAGE_SIZE). - - Used for interrupt 18 - Machine Check Exception (#MC). - - MCE can be delivered at any time, including when the kernel is in the - middle of switching stacks. Using IST for MCE events avoids making - assumptions about the previous state of the kernel stack. - -For more details see the Intel IA32 or AMD AMD64 architecture manuals. - - -Printing backtraces on x86 --------------------------- - -The question about the '?' preceding function names in an x86 stacktrace -keeps popping up, here's an indepth explanation. It helps if the reader -stares at print_context_stack() and the whole machinery in and around -arch/x86/kernel/dumpstack.c. - -Adapted from Ingo's mail, Message-ID: <20150521101614.GA10889@gmail.com>: - -We always scan the full kernel stack for return addresses stored on -the kernel stack(s) [*], from stack top to stack bottom, and print out -anything that 'looks like' a kernel text address. - -If it fits into the frame pointer chain, we print it without a question -mark, knowing that it's part of the real backtrace. - -If the address does not fit into our expected frame pointer chain we -still print it, but we print a '?'. It can mean two things: - - - either the address is not part of the call chain: it's just stale - values on the kernel stack, from earlier function calls. This is - the common case. - - - or it is part of the call chain, but the frame pointer was not set - up properly within the function, so we don't recognize it. - -This way we will always print out the real call chain (plus a few more -entries), regardless of whether the frame pointer was set up correctly -or not - but in most cases we'll get the call chain right as well. The -entries printed are strictly in stack order, so you can deduce more -information from that as well. - -The most important property of this method is that we _never_ lose -information: we always strive to print _all_ addresses on the stack(s) -that look like kernel text addresses, so if debug information is wrong, -we still print out the real call chain as well - just with more question -marks than ideal. - -[*] For things like IRQ and IST stacks, we also scan those stacks, in - the right order, and try to cross from one stack into another - reconstructing the call chain. This works most of the time. diff --git a/Documentation/x86/kernel-stacks.rst b/Documentation/x86/kernel-stacks.rst new file mode 100644 index 000000000000..c7c7afce086f --- /dev/null +++ b/Documentation/x86/kernel-stacks.rst @@ -0,0 +1,147 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +Kernel Stacks +============= + +Kernel stacks on x86-64 bit +=========================== + +Most of the text from Keith Owens, hacked by AK + +x86_64 page size (PAGE_SIZE) is 4K. + +Like all other architectures, x86_64 has a kernel stack for every +active thread. These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big. +These stacks contain useful data as long as a thread is alive or a +zombie. While the thread is in user space the kernel stack is empty +except for the thread_info structure at the bottom. + +In addition to the per thread stacks, there are specialized stacks +associated with each CPU. These stacks are only used while the kernel +is in control on that CPU; when a CPU returns to user space the +specialized stacks contain no useful data. The main CPU stacks are: + +* Interrupt stack. IRQ_STACK_SIZE + + Used for external hardware interrupts. If this is the first external + hardware interrupt (i.e. not a nested hardware interrupt) then the + kernel switches from the current task to the interrupt stack. Like + the split thread and interrupt stacks on i386, this gives more room + for kernel interrupt processing without having to increase the size + of every per thread stack. + + The interrupt stack is also used when processing a softirq. + +Switching to the kernel interrupt stack is done by software based on a +per CPU interrupt nest counter. This is needed because x86-64 "IST" +hardware stacks cannot nest without races. + +x86_64 also has a feature which is not available on i386, the ability +to automatically switch to a new stack for designated events such as +double fault or NMI, which makes it easier to handle these unusual +events on x86_64. This feature is called the Interrupt Stack Table +(IST). There can be up to 7 IST entries per CPU. The IST code is an +index into the Task State Segment (TSS). The IST entries in the TSS +point to dedicated stacks; each stack can be a different size. + +An IST is selected by a non-zero value in the IST field of an +interrupt-gate descriptor. When an interrupt occurs and the hardware +loads such a descriptor, the hardware automatically sets the new stack +pointer based on the IST value, then invokes the interrupt handler. If +the interrupt came from user mode, then the interrupt handler prologue +will switch back to the per-thread stack. If software wants to allow +nested IST interrupts then the handler must adjust the IST values on +entry to and exit from the interrupt handler. (This is occasionally +done, e.g. for debug exceptions.) + +Events with different IST codes (i.e. with different stacks) can be +nested. For example, a debug interrupt can safely be interrupted by an +NMI. arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack +pointers on entry to and exit from all IST events, in theory allowing +IST events with the same code to be nested. However in most cases, the +stack size allocated to an IST assumes no nesting for the same code. +If that assumption is ever broken then the stacks will become corrupt. + +The currently assigned IST stacks are: + +* DOUBLEFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE). + + Used for interrupt 8 - Double Fault Exception (#DF). + + Invoked when handling one exception causes another exception. Happens + when the kernel is very confused (e.g. kernel stack pointer corrupt). + Using a separate stack allows the kernel to recover from it well enough + in many cases to still output an oops. + +* NMI_STACK. EXCEPTION_STKSZ (PAGE_SIZE). + + Used for non-maskable interrupts (NMI). + + NMI can be delivered at any time, including when the kernel is in the + middle of switching stacks. Using IST for NMI events avoids making + assumptions about the previous state of the kernel stack. + +* DEBUG_STACK. DEBUG_STKSZ + + Used for hardware debug interrupts (interrupt 1) and for software + debug interrupts (INT3). + + When debugging a kernel, debug interrupts (both hardware and + software) can occur at any time. Using IST for these interrupts + avoids making assumptions about the previous state of the kernel + stack. + +* MCE_STACK. EXCEPTION_STKSZ (PAGE_SIZE). + + Used for interrupt 18 - Machine Check Exception (#MC). + + MCE can be delivered at any time, including when the kernel is in the + middle of switching stacks. Using IST for MCE events avoids making + assumptions about the previous state of the kernel stack. + +For more details see the Intel IA32 or AMD AMD64 architecture manuals. + + +Printing backtraces on x86 +========================== + +The question about the '?' preceding function names in an x86 stacktrace +keeps popping up, here's an indepth explanation. It helps if the reader +stares at print_context_stack() and the whole machinery in and around +arch/x86/kernel/dumpstack.c. + +Adapted from Ingo's mail, Message-ID: <20150521101614.GA10889@gmail.com>: + +We always scan the full kernel stack for return addresses stored on +the kernel stack(s) [1]_, from stack top to stack bottom, and print out +anything that 'looks like' a kernel text address. + +If it fits into the frame pointer chain, we print it without a question +mark, knowing that it's part of the real backtrace. + +If the address does not fit into our expected frame pointer chain we +still print it, but we print a '?'. It can mean two things: + + - either the address is not part of the call chain: it's just stale + values on the kernel stack, from earlier function calls. This is + the common case. + + - or it is part of the call chain, but the frame pointer was not set + up properly within the function, so we don't recognize it. + +This way we will always print out the real call chain (plus a few more +entries), regardless of whether the frame pointer was set up correctly +or not - but in most cases we'll get the call chain right as well. The +entries printed are strictly in stack order, so you can deduce more +information from that as well. + +The most important property of this method is that we _never_ lose +information: we always strive to print _all_ addresses on the stack(s) +that look like kernel text addresses, so if debug information is wrong, +we still print out the real call chain as well - just with more question +marks than ideal. + +.. [1] For things like IRQ and IST stacks, we also scan those stacks, in + the right order, and try to cross from one stack into another + reconstructing the call chain. This works most of the time. -- cgit v1.2.3 From c2dea5cda0729fdd91760cdad6bb1166037be74a Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:20 +0800 Subject: Documentation: x86: convert entry_64.txt to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/entry_64.rst | 110 +++++++++++++++++++++++++++++++++++++++++ Documentation/x86/entry_64.txt | 104 -------------------------------------- Documentation/x86/index.rst | 1 + 3 files changed, 111 insertions(+), 104 deletions(-) create mode 100644 Documentation/x86/entry_64.rst delete mode 100644 Documentation/x86/entry_64.txt (limited to 'Documentation/x86') diff --git a/Documentation/x86/entry_64.rst b/Documentation/x86/entry_64.rst new file mode 100644 index 000000000000..a48b3f6ebbe8 --- /dev/null +++ b/Documentation/x86/entry_64.rst @@ -0,0 +1,110 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============== +Kernel Entries +============== + +This file documents some of the kernel entries in +arch/x86/entry/entry_64.S. A lot of this explanation is adapted from +an email from Ingo Molnar: + +http://lkml.kernel.org/r/<20110529191055.GC9835%40elte.hu> + +The x86 architecture has quite a few different ways to jump into +kernel code. Most of these entry points are registered in +arch/x86/kernel/traps.c and implemented in arch/x86/entry/entry_64.S +for 64-bit, arch/x86/entry/entry_32.S for 32-bit and finally +arch/x86/entry/entry_64_compat.S which implements the 32-bit compatibility +syscall entry points and thus provides for 32-bit processes the +ability to execute syscalls when running on 64-bit kernels. + +The IDT vector assignments are listed in arch/x86/include/asm/irq_vectors.h. + +Some of these entries are: + + - system_call: syscall instruction from 64-bit code. + + - entry_INT80_compat: int 0x80 from 32-bit or 64-bit code; compat syscall + either way. + + - entry_INT80_compat, ia32_sysenter: syscall and sysenter from 32-bit + code + + - interrupt: An array of entries. Every IDT vector that doesn't + explicitly point somewhere else gets set to the corresponding + value in interrupts. These point to a whole array of + magically-generated functions that make their way to do_IRQ with + the interrupt number as a parameter. + + - APIC interrupts: Various special-purpose interrupts for things + like TLB shootdown. + + - Architecturally-defined exceptions like divide_error. + +There are a few complexities here. The different x86-64 entries +have different calling conventions. The syscall and sysenter +instructions have their own peculiar calling conventions. Some of +the IDT entries push an error code onto the stack; others don't. +IDT entries using the IST alternative stack mechanism need their own +magic to get the stack frames right. (You can find some +documentation in the AMD APM, Volume 2, Chapter 8 and the Intel SDM, +Volume 3, Chapter 6.) + +Dealing with the swapgs instruction is especially tricky. Swapgs +toggles whether gs is the kernel gs or the user gs. The swapgs +instruction is rather fragile: it must nest perfectly and only in +single depth, it should only be used if entering from user mode to +kernel mode and then when returning to user-space, and precisely +so. If we mess that up even slightly, we crash. + +So when we have a secondary entry, already in kernel mode, we *must +not* use SWAPGS blindly - nor must we forget doing a SWAPGS when it's +not switched/swapped yet. + +Now, there's a secondary complication: there's a cheap way to test +which mode the CPU is in and an expensive way. + +The cheap way is to pick this info off the entry frame on the kernel +stack, from the CS of the ptregs area of the kernel stack:: + + xorl %ebx,%ebx + testl $3,CS+8(%rsp) + je error_kernelspace + SWAPGS + +The expensive (paranoid) way is to read back the MSR_GS_BASE value +(which is what SWAPGS modifies):: + + movl $1,%ebx + movl $MSR_GS_BASE,%ecx + rdmsr + testl %edx,%edx + js 1f /* negative -> in kernel */ + SWAPGS + xorl %ebx,%ebx + 1: ret + +If we are at an interrupt or user-trap/gate-alike boundary then we can +use the faster check: the stack will be a reliable indicator of +whether SWAPGS was already done: if we see that we are a secondary +entry interrupting kernel mode execution, then we know that the GS +base has already been switched. If it says that we interrupted +user-space execution then we must do the SWAPGS. + +But if we are in an NMI/MCE/DEBUG/whatever super-atomic entry context, +which might have triggered right after a normal entry wrote CS to the +stack but before we executed SWAPGS, then the only safe way to check +for GS is the slower method: the RDMSR. + +Therefore, super-atomic entries (except NMI, which is handled separately) +must use idtentry with paranoid=1 to handle gsbase correctly. This +triggers three main behavior changes: + + - Interrupt entry will use the slower gsbase check. + - Interrupt entry from user mode will switch off the IST stack. + - Interrupt exit to kernel mode will not attempt to reschedule. + +We try to only use IST entries and the paranoid entry code for vectors +that absolutely need the more expensive check for the GS base - and we +generate all 'normal' entry points with the regular (faster) paranoid=0 +variant. diff --git a/Documentation/x86/entry_64.txt b/Documentation/x86/entry_64.txt deleted file mode 100644 index c1df8eba9dfd..000000000000 --- a/Documentation/x86/entry_64.txt +++ /dev/null @@ -1,104 +0,0 @@ -This file documents some of the kernel entries in -arch/x86/entry/entry_64.S. A lot of this explanation is adapted from -an email from Ingo Molnar: - -http://lkml.kernel.org/r/<20110529191055.GC9835%40elte.hu> - -The x86 architecture has quite a few different ways to jump into -kernel code. Most of these entry points are registered in -arch/x86/kernel/traps.c and implemented in arch/x86/entry/entry_64.S -for 64-bit, arch/x86/entry/entry_32.S for 32-bit and finally -arch/x86/entry/entry_64_compat.S which implements the 32-bit compatibility -syscall entry points and thus provides for 32-bit processes the -ability to execute syscalls when running on 64-bit kernels. - -The IDT vector assignments are listed in arch/x86/include/asm/irq_vectors.h. - -Some of these entries are: - - - system_call: syscall instruction from 64-bit code. - - - entry_INT80_compat: int 0x80 from 32-bit or 64-bit code; compat syscall - either way. - - - entry_INT80_compat, ia32_sysenter: syscall and sysenter from 32-bit - code - - - interrupt: An array of entries. Every IDT vector that doesn't - explicitly point somewhere else gets set to the corresponding - value in interrupts. These point to a whole array of - magically-generated functions that make their way to do_IRQ with - the interrupt number as a parameter. - - - APIC interrupts: Various special-purpose interrupts for things - like TLB shootdown. - - - Architecturally-defined exceptions like divide_error. - -There are a few complexities here. The different x86-64 entries -have different calling conventions. The syscall and sysenter -instructions have their own peculiar calling conventions. Some of -the IDT entries push an error code onto the stack; others don't. -IDT entries using the IST alternative stack mechanism need their own -magic to get the stack frames right. (You can find some -documentation in the AMD APM, Volume 2, Chapter 8 and the Intel SDM, -Volume 3, Chapter 6.) - -Dealing with the swapgs instruction is especially tricky. Swapgs -toggles whether gs is the kernel gs or the user gs. The swapgs -instruction is rather fragile: it must nest perfectly and only in -single depth, it should only be used if entering from user mode to -kernel mode and then when returning to user-space, and precisely -so. If we mess that up even slightly, we crash. - -So when we have a secondary entry, already in kernel mode, we *must -not* use SWAPGS blindly - nor must we forget doing a SWAPGS when it's -not switched/swapped yet. - -Now, there's a secondary complication: there's a cheap way to test -which mode the CPU is in and an expensive way. - -The cheap way is to pick this info off the entry frame on the kernel -stack, from the CS of the ptregs area of the kernel stack: - - xorl %ebx,%ebx - testl $3,CS+8(%rsp) - je error_kernelspace - SWAPGS - -The expensive (paranoid) way is to read back the MSR_GS_BASE value -(which is what SWAPGS modifies): - - movl $1,%ebx - movl $MSR_GS_BASE,%ecx - rdmsr - testl %edx,%edx - js 1f /* negative -> in kernel */ - SWAPGS - xorl %ebx,%ebx -1: ret - -If we are at an interrupt or user-trap/gate-alike boundary then we can -use the faster check: the stack will be a reliable indicator of -whether SWAPGS was already done: if we see that we are a secondary -entry interrupting kernel mode execution, then we know that the GS -base has already been switched. If it says that we interrupted -user-space execution then we must do the SWAPGS. - -But if we are in an NMI/MCE/DEBUG/whatever super-atomic entry context, -which might have triggered right after a normal entry wrote CS to the -stack but before we executed SWAPGS, then the only safe way to check -for GS is the slower method: the RDMSR. - -Therefore, super-atomic entries (except NMI, which is handled separately) -must use idtentry with paranoid=1 to handle gsbase correctly. This -triggers three main behavior changes: - - - Interrupt entry will use the slower gsbase check. - - Interrupt entry from user mode will switch off the IST stack. - - Interrupt exit to kernel mode will not attempt to reschedule. - -We try to only use IST entries and the paranoid entry code for vectors -that absolutely need the more expensive check for the GS base - and we -generate all 'normal' entry points with the regular (faster) paranoid=0 -variant. diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index f6f4e0fc79f2..0e3e73458738 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -12,3 +12,4 @@ x86-specific Documentation topology exception-tables kernel-stacks + entry_64 -- cgit v1.2.3 From 4b1357600200cd2b56ecc6fb452d50d4079b42ed Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:21 +0800 Subject: Documentation: x86: convert earlyprintk.txt to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/earlyprintk.rst | 151 ++++++++++++++++++++++++++++++++++++++ Documentation/x86/earlyprintk.txt | 141 ----------------------------------- Documentation/x86/index.rst | 1 + 3 files changed, 152 insertions(+), 141 deletions(-) create mode 100644 Documentation/x86/earlyprintk.rst delete mode 100644 Documentation/x86/earlyprintk.txt (limited to 'Documentation/x86') diff --git a/Documentation/x86/earlyprintk.rst b/Documentation/x86/earlyprintk.rst new file mode 100644 index 000000000000..11307378acf0 --- /dev/null +++ b/Documentation/x86/earlyprintk.rst @@ -0,0 +1,151 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============ +Early Printk +============ + +Mini-HOWTO for using the earlyprintk=dbgp boot option with a +USB2 Debug port key and a debug cable, on x86 systems. + +You need two computers, the 'USB debug key' special gadget and +and two USB cables, connected like this:: + + [host/target] <-------> [USB debug key] <-------> [client/console] + +Hardware requirements +===================== + + a) Host/target system needs to have USB debug port capability. + + You can check this capability by looking at a 'Debug port' bit in + the lspci -vvv output:: + + # lspci -vvv + ... + 00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #1 (rev 03) (prog-if 20 [EHCI]) + Subsystem: Lenovo ThinkPad T61 + Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx- + Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR- /grub.cfg. + + On systems with more than one EHCI debug controller you must + specify the correct EHCI debug controller number. The ordering + comes from the PCI bus enumeration of the EHCI controllers. The + default with no number argument is "0" or the first EHCI debug + controller. To use the second EHCI debug controller, you would + use the command line: "earlyprintk=dbgp1" + + .. note:: + normally earlyprintk console gets turned off once the + regular console is alive - use "earlyprintk=dbgp,keep" to keep + this channel open beyond early bootup. This can be useful for + debugging crashes under Xorg, etc. + + b) On the client/console system: + + You should enable the following kernel config option:: + + CONFIG_USB_SERIAL_DEBUG=y + + On the next bootup with the modified kernel you should + get a /dev/ttyUSBx device(s). + + Now this channel of kernel messages is ready to be used: start + your favorite terminal emulator (minicom, etc.) and set + it up to use /dev/ttyUSB0 - or use a raw 'cat /dev/ttyUSBx' to + see the raw output. + + c) On Nvidia Southbridge based systems: the kernel will try to probe + and find out which port has a debug device connected. + +Testing +======= + +You can test the output by using earlyprintk=dbgp,keep and provoking +kernel messages on the host/target system. You can provoke a harmless +kernel message by for example doing:: + + echo h > /proc/sysrq-trigger + +On the host/target system you should see this help line in "dmesg" output:: + + SysRq : HELP : loglevel(0-9) reBoot Crashdump terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z) + +On the client/console system do:: + + cat /dev/ttyUSB0 + +And you should see the help line above displayed shortly after you've +provoked it on the host system. + +If it does not work then please ask about it on the linux-kernel@vger.kernel.org +mailing list or contact the x86 maintainers. diff --git a/Documentation/x86/earlyprintk.txt b/Documentation/x86/earlyprintk.txt deleted file mode 100644 index 46933e06c972..000000000000 --- a/Documentation/x86/earlyprintk.txt +++ /dev/null @@ -1,141 +0,0 @@ - -Mini-HOWTO for using the earlyprintk=dbgp boot option with a -USB2 Debug port key and a debug cable, on x86 systems. - -You need two computers, the 'USB debug key' special gadget and -and two USB cables, connected like this: - - [host/target] <-------> [USB debug key] <-------> [client/console] - -1. There are a number of specific hardware requirements: - - a.) Host/target system needs to have USB debug port capability. - - You can check this capability by looking at a 'Debug port' bit in - the lspci -vvv output: - - # lspci -vvv - ... - 00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #1 (rev 03) (prog-if 20 [EHCI]) - Subsystem: Lenovo ThinkPad T61 - Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx- - Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR- /grub.cfg.) - - On systems with more than one EHCI debug controller you must - specify the correct EHCI debug controller number. The ordering - comes from the PCI bus enumeration of the EHCI controllers. The - default with no number argument is "0" or the first EHCI debug - controller. To use the second EHCI debug controller, you would - use the command line: "earlyprintk=dbgp1" - - NOTE: normally earlyprintk console gets turned off once the - regular console is alive - use "earlyprintk=dbgp,keep" to keep - this channel open beyond early bootup. This can be useful for - debugging crashes under Xorg, etc. - - b.) On the client/console system: - - You should enable the following kernel config option: - - CONFIG_USB_SERIAL_DEBUG=y - - On the next bootup with the modified kernel you should - get a /dev/ttyUSBx device(s). - - Now this channel of kernel messages is ready to be used: start - your favorite terminal emulator (minicom, etc.) and set - it up to use /dev/ttyUSB0 - or use a raw 'cat /dev/ttyUSBx' to - see the raw output. - - c.) On Nvidia Southbridge based systems: the kernel will try to probe - and find out which port has a debug device connected. - -3. Testing that it works fine: - - You can test the output by using earlyprintk=dbgp,keep and provoking - kernel messages on the host/target system. You can provoke a harmless - kernel message by for example doing: - - echo h > /proc/sysrq-trigger - - On the host/target system you should see this help line in "dmesg" output: - - SysRq : HELP : loglevel(0-9) reBoot Crashdump terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z) - - On the client/console system do: - - cat /dev/ttyUSB0 - - And you should see the help line above displayed shortly after you've - provoked it on the host system. - -If it does not work then please ask about it on the linux-kernel@vger.kernel.org -mailing list or contact the x86 maintainers. diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index 0e3e73458738..d9ccc0f39279 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -13,3 +13,4 @@ x86-specific Documentation exception-tables kernel-stacks entry_64 + earlyprintk -- cgit v1.2.3 From 0c2d3639a81b1033f3cb5d3c5ae4d9e7e1313cc4 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:22 +0800 Subject: Documentation: x86: convert zero-page.txt to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/index.rst | 1 + Documentation/x86/zero-page.rst | 45 +++++++++++++++++++++++++++++++++++++++++ Documentation/x86/zero-page.txt | 40 ------------------------------------ 3 files changed, 46 insertions(+), 40 deletions(-) create mode 100644 Documentation/x86/zero-page.rst delete mode 100644 Documentation/x86/zero-page.txt (limited to 'Documentation/x86') diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index d9ccc0f39279..e43aa9b31976 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -14,3 +14,4 @@ x86-specific Documentation kernel-stacks entry_64 earlyprintk + zero-page diff --git a/Documentation/x86/zero-page.rst b/Documentation/x86/zero-page.rst new file mode 100644 index 000000000000..f088f5881666 --- /dev/null +++ b/Documentation/x86/zero-page.rst @@ -0,0 +1,45 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========= +Zero Page +========= +The additional fields in struct boot_params as a part of 32-bit boot +protocol of kernel. These should be filled by bootloader or 16-bit +real-mode setup code of the kernel. References/settings to it mainly +are in:: + + arch/x86/include/uapi/asm/bootparam.h + +=========== ===== ======================= ================================================= +Offset/Size Proto Name Meaning + +000/040 ALL screen_info Text mode or frame buffer information + (struct screen_info) +040/014 ALL apm_bios_info APM BIOS information (struct apm_bios_info) +058/008 ALL tboot_addr Physical address of tboot shared page +060/010 ALL ist_info Intel SpeedStep (IST) BIOS support information + (struct ist_info) +080/010 ALL hd0_info hd0 disk parameter, OBSOLETE!! +090/010 ALL hd1_info hd1 disk parameter, OBSOLETE!! +0A0/010 ALL sys_desc_table System description table (struct sys_desc_table), + OBSOLETE!! +0B0/010 ALL olpc_ofw_header OLPC's OpenFirmware CIF and friends +0C0/004 ALL ext_ramdisk_image ramdisk_image high 32bits +0C4/004 ALL ext_ramdisk_size ramdisk_size high 32bits +0C8/004 ALL ext_cmd_line_ptr cmd_line_ptr high 32bits +140/080 ALL edid_info Video mode setup (struct edid_info) +1C0/020 ALL efi_info EFI 32 information (struct efi_info) +1E0/004 ALL alt_mem_k Alternative mem check, in KB +1E4/004 ALL scratch Scratch field for the kernel setup code +1E8/001 ALL e820_entries Number of entries in e820_table (below) +1E9/001 ALL eddbuf_entries Number of entries in eddbuf (below) +1EA/001 ALL edd_mbr_sig_buf_entries Number of entries in edd_mbr_sig_buffer + (below) +1EB/001 ALL kbd_status Numlock is enabled +1EC/001 ALL secure_boot Secure boot is enabled in the firmware +1EF/001 ALL sentinel Used to detect broken bootloaders +290/040 ALL edd_mbr_sig_buffer EDD MBR signatures +2D0/A00 ALL e820_table E820 memory map table + (array of struct e820_entry) +D00/1EC ALL eddbuf EDD data (array of struct edd_info) +=========== ===== ======================= ================================================= diff --git a/Documentation/x86/zero-page.txt b/Documentation/x86/zero-page.txt deleted file mode 100644 index 68aed077f7b6..000000000000 --- a/Documentation/x86/zero-page.txt +++ /dev/null @@ -1,40 +0,0 @@ -The additional fields in struct boot_params as a part of 32-bit boot -protocol of kernel. These should be filled by bootloader or 16-bit -real-mode setup code of the kernel. References/settings to it mainly -are in: - - arch/x86/include/uapi/asm/bootparam.h - - -Offset Proto Name Meaning -/Size - -000/040 ALL screen_info Text mode or frame buffer information - (struct screen_info) -040/014 ALL apm_bios_info APM BIOS information (struct apm_bios_info) -058/008 ALL tboot_addr Physical address of tboot shared page -060/010 ALL ist_info Intel SpeedStep (IST) BIOS support information - (struct ist_info) -080/010 ALL hd0_info hd0 disk parameter, OBSOLETE!! -090/010 ALL hd1_info hd1 disk parameter, OBSOLETE!! -0A0/010 ALL sys_desc_table System description table (struct sys_desc_table), - OBSOLETE!! -0B0/010 ALL olpc_ofw_header OLPC's OpenFirmware CIF and friends -0C0/004 ALL ext_ramdisk_image ramdisk_image high 32bits -0C4/004 ALL ext_ramdisk_size ramdisk_size high 32bits -0C8/004 ALL ext_cmd_line_ptr cmd_line_ptr high 32bits -140/080 ALL edid_info Video mode setup (struct edid_info) -1C0/020 ALL efi_info EFI 32 information (struct efi_info) -1E0/004 ALL alt_mem_k Alternative mem check, in KB -1E4/004 ALL scratch Scratch field for the kernel setup code -1E8/001 ALL e820_entries Number of entries in e820_table (below) -1E9/001 ALL eddbuf_entries Number of entries in eddbuf (below) -1EA/001 ALL edd_mbr_sig_buf_entries Number of entries in edd_mbr_sig_buffer - (below) -1EB/001 ALL kbd_status Numlock is enabled -1EC/001 ALL secure_boot Secure boot is enabled in the firmware -1EF/001 ALL sentinel Used to detect broken bootloaders -290/040 ALL edd_mbr_sig_buffer EDD MBR signatures -2D0/A00 ALL e820_table E820 memory map table - (array of struct e820_entry) -D00/1EC ALL eddbuf EDD data (array of struct edd_info) -- cgit v1.2.3 From 17156044b11c878a9fdd8326cf47bc0cbd1aa918 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:23 +0800 Subject: Documentation: x86: convert tlb.txt to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/index.rst | 1 + Documentation/x86/tlb.rst | 83 +++++++++++++++++++++++++++++++++++++++++++++ Documentation/x86/tlb.txt | 75 ---------------------------------------- 3 files changed, 84 insertions(+), 75 deletions(-) create mode 100644 Documentation/x86/tlb.rst delete mode 100644 Documentation/x86/tlb.txt (limited to 'Documentation/x86') diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index e43aa9b31976..c4ea25350221 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -15,3 +15,4 @@ x86-specific Documentation entry_64 earlyprintk zero-page + tlb diff --git a/Documentation/x86/tlb.rst b/Documentation/x86/tlb.rst new file mode 100644 index 000000000000..82ec58ae63a8 --- /dev/null +++ b/Documentation/x86/tlb.rst @@ -0,0 +1,83 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======= +The TLB +======= + +When the kernel unmaps or modified the attributes of a range of +memory, it has two choices: + + 1. Flush the entire TLB with a two-instruction sequence. This is + a quick operation, but it causes collateral damage: TLB entries + from areas other than the one we are trying to flush will be + destroyed and must be refilled later, at some cost. + 2. Use the invlpg instruction to invalidate a single page at a + time. This could potentially cost many more instructions, but + it is a much more precise operation, causing no collateral + damage to other TLB entries. + +Which method to do depends on a few things: + + 1. The size of the flush being performed. A flush of the entire + address space is obviously better performed by flushing the + entire TLB than doing 2^48/PAGE_SIZE individual flushes. + 2. The contents of the TLB. If the TLB is empty, then there will + be no collateral damage caused by doing the global flush, and + all of the individual flush will have ended up being wasted + work. + 3. The size of the TLB. The larger the TLB, the more collateral + damage we do with a full flush. So, the larger the TLB, the + more attractive an individual flush looks. Data and + instructions have separate TLBs, as do different page sizes. + 4. The microarchitecture. The TLB has become a multi-level + cache on modern CPUs, and the global flushes have become more + expensive relative to single-page flushes. + +There is obviously no way the kernel can know all these things, +especially the contents of the TLB during a given flush. The +sizes of the flush will vary greatly depending on the workload as +well. There is essentially no "right" point to choose. + +You may be doing too many individual invalidations if you see the +invlpg instruction (or instructions _near_ it) show up high in +profiles. If you believe that individual invalidations being +called too often, you can lower the tunable:: + + /sys/kernel/debug/x86/tlb_single_page_flush_ceiling + +This will cause us to do the global flush for more cases. +Lowering it to 0 will disable the use of the individual flushes. +Setting it to 1 is a very conservative setting and it should +never need to be 0 under normal circumstances. + +Despite the fact that a single individual flush on x86 is +guaranteed to flush a full 2MB [1]_, hugetlbfs always uses the full +flushes. THP is treated exactly the same as normal memory. + +You might see invlpg inside of flush_tlb_mm_range() show up in +profiles, or you can use the trace_tlb_flush() tracepoints. to +determine how long the flush operations are taking. + +Essentially, you are balancing the cycles you spend doing invlpg +with the cycles that you spend refilling the TLB later. + +You can measure how expensive TLB refills are by using +performance counters and 'perf stat', like this:: + + perf stat -e + cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/, + cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/, + cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/, + cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/, + cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/, + cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/ + +That works on an IvyBridge-era CPU (i5-3320M). Different CPUs +may have differently-named counters, but they should at least +be there in some form. You can use pmu-tools 'ocperf list' +(https://github.com/andikleen/pmu-tools) to find the right +counters for a given CPU. + +.. [1] A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation" + says: "One execution of INVLPG is sufficient even for a page + with size greater than 4 KBytes." diff --git a/Documentation/x86/tlb.txt b/Documentation/x86/tlb.txt deleted file mode 100644 index 6a0607b99ed8..000000000000 --- a/Documentation/x86/tlb.txt +++ /dev/null @@ -1,75 +0,0 @@ -When the kernel unmaps or modified the attributes of a range of -memory, it has two choices: - 1. Flush the entire TLB with a two-instruction sequence. This is - a quick operation, but it causes collateral damage: TLB entries - from areas other than the one we are trying to flush will be - destroyed and must be refilled later, at some cost. - 2. Use the invlpg instruction to invalidate a single page at a - time. This could potentially cost many more instructions, but - it is a much more precise operation, causing no collateral - damage to other TLB entries. - -Which method to do depends on a few things: - 1. The size of the flush being performed. A flush of the entire - address space is obviously better performed by flushing the - entire TLB than doing 2^48/PAGE_SIZE individual flushes. - 2. The contents of the TLB. If the TLB is empty, then there will - be no collateral damage caused by doing the global flush, and - all of the individual flush will have ended up being wasted - work. - 3. The size of the TLB. The larger the TLB, the more collateral - damage we do with a full flush. So, the larger the TLB, the - more attractive an individual flush looks. Data and - instructions have separate TLBs, as do different page sizes. - 4. The microarchitecture. The TLB has become a multi-level - cache on modern CPUs, and the global flushes have become more - expensive relative to single-page flushes. - -There is obviously no way the kernel can know all these things, -especially the contents of the TLB during a given flush. The -sizes of the flush will vary greatly depending on the workload as -well. There is essentially no "right" point to choose. - -You may be doing too many individual invalidations if you see the -invlpg instruction (or instructions _near_ it) show up high in -profiles. If you believe that individual invalidations being -called too often, you can lower the tunable: - - /sys/kernel/debug/x86/tlb_single_page_flush_ceiling - -This will cause us to do the global flush for more cases. -Lowering it to 0 will disable the use of the individual flushes. -Setting it to 1 is a very conservative setting and it should -never need to be 0 under normal circumstances. - -Despite the fact that a single individual flush on x86 is -guaranteed to flush a full 2MB [1], hugetlbfs always uses the full -flushes. THP is treated exactly the same as normal memory. - -You might see invlpg inside of flush_tlb_mm_range() show up in -profiles, or you can use the trace_tlb_flush() tracepoints. to -determine how long the flush operations are taking. - -Essentially, you are balancing the cycles you spend doing invlpg -with the cycles that you spend refilling the TLB later. - -You can measure how expensive TLB refills are by using -performance counters and 'perf stat', like this: - -perf stat -e - cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/, - cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/, - cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/, - cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/, - cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/, - cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/ - -That works on an IvyBridge-era CPU (i5-3320M). Different CPUs -may have differently-named counters, but they should at least -be there in some form. You can use pmu-tools 'ocperf list' -(https://github.com/andikleen/pmu-tools) to find the right -counters for a given CPU. - -1. A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation" - says: "One execution of INVLPG is sufficient even for a page - with size greater than 4 KBytes." -- cgit v1.2.3 From 26d14a2025f4e0d0aa3c157e1421e27fcc2d2bb3 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:24 +0800 Subject: Documentation: x86: convert mtrr.txt to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Signed-off-by: Jonathan Corbet --- Documentation/x86/index.rst | 1 + Documentation/x86/mtrr.rst | 354 ++++++++++++++++++++++++++++++++++++++++++++ Documentation/x86/mtrr.txt | 329 ---------------------------------------- 3 files changed, 355 insertions(+), 329 deletions(-) create mode 100644 Documentation/x86/mtrr.rst delete mode 100644 Documentation/x86/mtrr.txt (limited to 'Documentation/x86') diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index c4ea25350221..769c4491e9cb 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -16,3 +16,4 @@ x86-specific Documentation earlyprintk zero-page tlb + mtrr diff --git a/Documentation/x86/mtrr.rst b/Documentation/x86/mtrr.rst new file mode 100644 index 000000000000..c5b695d75349 --- /dev/null +++ b/Documentation/x86/mtrr.rst @@ -0,0 +1,354 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================= +MTRR (Memory Type Range Register) control +========================================= + +:Authors: - Richard Gooch - 3 Jun 1999 + - Luis R. Rodriguez - April 9, 2015 + + +Phasing out MTRR use +==================== + +MTRR use is replaced on modern x86 hardware with PAT. Direct MTRR use by +drivers on Linux is now completely phased out, device drivers should use +arch_phys_wc_add() in combination with ioremap_wc() to make MTRR effective on +non-PAT systems while a no-op but equally effective on PAT enabled systems. + +Even if Linux does not use MTRRs directly, some x86 platform firmware may still +set up MTRRs early before booting the OS. They do this as some platform +firmware may still have implemented access to MTRRs which would be controlled +and handled by the platform firmware directly. An example of platform use of +MTRRs is through the use of SMI handlers, one case could be for fan control, +the platform code would need uncachable access to some of its fan control +registers. Such platform access does not need any Operating System MTRR code in +place other than mtrr_type_lookup() to ensure any OS specific mapping requests +are aligned with platform MTRR setup. If MTRRs are only set up by the platform +firmware code though and the OS does not make any specific MTRR mapping +requests mtrr_type_lookup() should always return MTRR_TYPE_INVALID. + +For details refer to :doc:`pat`. + +.. tip:: + On Intel P6 family processors (Pentium Pro, Pentium II and later) + the Memory Type Range Registers (MTRRs) may be used to control + processor access to memory ranges. This is most useful when you have + a video (VGA) card on a PCI or AGP bus. Enabling write-combining + allows bus write transfers to be combined into a larger transfer + before bursting over the PCI/AGP bus. This can increase performance + of image write operations 2.5 times or more. + + The Cyrix 6x86, 6x86MX and M II processors have Address Range + Registers (ARRs) which provide a similar functionality to MTRRs. For + these, the ARRs are used to emulate the MTRRs. + + The AMD K6-2 (stepping 8 and above) and K6-3 processors have two + MTRRs. These are supported. The AMD Athlon family provide 8 Intel + style MTRRs. + + The Centaur C6 (WinChip) has 8 MCRs, allowing write-combining. These + are supported. + + The VIA Cyrix III and VIA C3 CPUs offer 8 Intel style MTRRs. + + The CONFIG_MTRR option creates a /proc/mtrr file which may be used + to manipulate your MTRRs. Typically the X server should use + this. This should have a reasonably generic interface so that + similar control registers on other processors can be easily + supported. + +There are two interfaces to /proc/mtrr: one is an ASCII interface +which allows you to read and write. The other is an ioctl() +interface. The ASCII interface is meant for administration. The +ioctl() interface is meant for C programs (i.e. the X server). The +interfaces are described below, with sample commands and C code. + + +Reading MTRRs from the shell +============================ +:: + + % cat /proc/mtrr + reg00: base=0x00000000 ( 0MB), size= 128MB: write-back, count=1 + reg01: base=0x08000000 ( 128MB), size= 64MB: write-back, count=1 + +Creating MTRRs from the C-shell:: + + # echo "base=0xf8000000 size=0x400000 type=write-combining" >! /proc/mtrr + +or if you use bash:: + + # echo "base=0xf8000000 size=0x400000 type=write-combining" >| /proc/mtrr + +And the result thereof:: + + % cat /proc/mtrr + reg00: base=0x00000000 ( 0MB), size= 128MB: write-back, count=1 + reg01: base=0x08000000 ( 128MB), size= 64MB: write-back, count=1 + reg02: base=0xf8000000 (3968MB), size= 4MB: write-combining, count=1 + +This is for video RAM at base address 0xf8000000 and size 4 megabytes. To +find out your base address, you need to look at the output of your X +server, which tells you where the linear framebuffer address is. A +typical line that you may get is:: + + (--) S3: PCI: 968 rev 0, Linear FB @ 0xf8000000 + +Note that you should only use the value from the X server, as it may +move the framebuffer base address, so the only value you can trust is +that reported by the X server. + +To find out the size of your framebuffer (what, you don't actually +know?), the following line will tell you:: + + (--) S3: videoram: 4096k + +That's 4 megabytes, which is 0x400000 bytes (in hexadecimal). +A patch is being written for XFree86 which will make this automatic: +in other words the X server will manipulate /proc/mtrr using the +ioctl() interface, so users won't have to do anything. If you use a +commercial X server, lobby your vendor to add support for MTRRs. + + +Creating overlapping MTRRs +========================== +:: + + %echo "base=0xfb000000 size=0x1000000 type=write-combining" >/proc/mtrr + %echo "base=0xfb000000 size=0x1000 type=uncachable" >/proc/mtrr + +And the results:: + + % cat /proc/mtrr + reg00: base=0x00000000 ( 0MB), size= 64MB: write-back, count=1 + reg01: base=0xfb000000 (4016MB), size= 16MB: write-combining, count=1 + reg02: base=0xfb000000 (4016MB), size= 4kB: uncachable, count=1 + +Some cards (especially Voodoo Graphics boards) need this 4 kB area +excluded from the beginning of the region because it is used for +registers. + +NOTE: You can only create type=uncachable region, if the first +region that you created is type=write-combining. + + +Removing MTRRs from the C-shel +============================== +:: + + % echo "disable=2" >! /proc/mtrr + +or using bash:: + + % echo "disable=2" >| /proc/mtrr + + +Reading MTRRs from a C program using ioctl()'s +============================================== +:: + + /* mtrr-show.c + + Source file for mtrr-show (example program to show MTRRs using ioctl()'s) + + Copyright (C) 1997-1998 Richard Gooch + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + + Richard Gooch may be reached by email at rgooch@atnf.csiro.au + The postal address is: + Richard Gooch, c/o ATNF, P. O. Box 76, Epping, N.S.W., 2121, Australia. + */ + + /* + This program will use an ioctl() on /proc/mtrr to show the current MTRR + settings. This is an alternative to reading /proc/mtrr. + + + Written by Richard Gooch 17-DEC-1997 + + Last updated by Richard Gooch 2-MAY-1998 + + + */ + #include + #include + #include + #include + #include + #include + #include + #include + #include + + #define TRUE 1 + #define FALSE 0 + #define ERRSTRING strerror (errno) + + static char *mtrr_strings[MTRR_NUM_TYPES] = + { + "uncachable", /* 0 */ + "write-combining", /* 1 */ + "?", /* 2 */ + "?", /* 3 */ + "write-through", /* 4 */ + "write-protect", /* 5 */ + "write-back", /* 6 */ + }; + + int main () + { + int fd; + struct mtrr_gentry gentry; + + if ( ( fd = open ("/proc/mtrr", O_RDONLY, 0) ) == -1 ) + { + if (errno == ENOENT) + { + fputs ("/proc/mtrr not found: not supported or you don't have a PPro?\n", + stderr); + exit (1); + } + fprintf (stderr, "Error opening /proc/mtrr\t%s\n", ERRSTRING); + exit (2); + } + for (gentry.regnum = 0; ioctl (fd, MTRRIOC_GET_ENTRY, &gentry) == 0; + ++gentry.regnum) + { + if (gentry.size < 1) + { + fprintf (stderr, "Register: %u disabled\n", gentry.regnum); + continue; + } + fprintf (stderr, "Register: %u base: 0x%lx size: 0x%lx type: %s\n", + gentry.regnum, gentry.base, gentry.size, + mtrr_strings[gentry.type]); + } + if (errno == EINVAL) exit (0); + fprintf (stderr, "Error doing ioctl(2) on /dev/mtrr\t%s\n", ERRSTRING); + exit (3); + } /* End Function main */ + + +Creating MTRRs from a C programme using ioctl()'s +================================================= +:: + + /* mtrr-add.c + + Source file for mtrr-add (example programme to add an MTRRs using ioctl()) + + Copyright (C) 1997-1998 Richard Gooch + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + + Richard Gooch may be reached by email at rgooch@atnf.csiro.au + The postal address is: + Richard Gooch, c/o ATNF, P. O. Box 76, Epping, N.S.W., 2121, Australia. + */ + + /* + This programme will use an ioctl() on /proc/mtrr to add an entry. The first + available mtrr is used. This is an alternative to writing /proc/mtrr. + + + Written by Richard Gooch 17-DEC-1997 + + Last updated by Richard Gooch 2-MAY-1998 + + + */ + #include + #include + #include + #include + #include + #include + #include + #include + #include + #include + + #define TRUE 1 + #define FALSE 0 + #define ERRSTRING strerror (errno) + + static char *mtrr_strings[MTRR_NUM_TYPES] = + { + "uncachable", /* 0 */ + "write-combining", /* 1 */ + "?", /* 2 */ + "?", /* 3 */ + "write-through", /* 4 */ + "write-protect", /* 5 */ + "write-back", /* 6 */ + }; + + int main (int argc, char **argv) + { + int fd; + struct mtrr_sentry sentry; + + if (argc != 4) + { + fprintf (stderr, "Usage:\tmtrr-add base size type\n"); + exit (1); + } + sentry.base = strtoul (argv[1], NULL, 0); + sentry.size = strtoul (argv[2], NULL, 0); + for (sentry.type = 0; sentry.type < MTRR_NUM_TYPES; ++sentry.type) + { + if (strcmp (argv[3], mtrr_strings[sentry.type]) == 0) break; + } + if (sentry.type >= MTRR_NUM_TYPES) + { + fprintf (stderr, "Illegal type: \"%s\"\n", argv[3]); + exit (2); + } + if ( ( fd = open ("/proc/mtrr", O_WRONLY, 0) ) == -1 ) + { + if (errno == ENOENT) + { + fputs ("/proc/mtrr not found: not supported or you don't have a PPro?\n", + stderr); + exit (3); + } + fprintf (stderr, "Error opening /proc/mtrr\t%s\n", ERRSTRING); + exit (4); + } + if (ioctl (fd, MTRRIOC_ADD_ENTRY, &sentry) == -1) + { + fprintf (stderr, "Error doing ioctl(2) on /dev/mtrr\t%s\n", ERRSTRING); + exit (5); + } + fprintf (stderr, "Sleeping for 5 seconds so you can see the new entry\n"); + sleep (5); + close (fd); + fputs ("I've just closed /proc/mtrr so now the new entry should be gone\n", + stderr); + } /* End Function main */ diff --git a/Documentation/x86/mtrr.txt b/Documentation/x86/mtrr.txt deleted file mode 100644 index dc3e703913ac..000000000000 --- a/Documentation/x86/mtrr.txt +++ /dev/null @@ -1,329 +0,0 @@ -MTRR (Memory Type Range Register) control - -Richard Gooch - 3 Jun 1999 -Luis R. Rodriguez - April 9, 2015 - -=============================================================================== -Phasing out MTRR use - -MTRR use is replaced on modern x86 hardware with PAT. Direct MTRR use by -drivers on Linux is now completely phased out, device drivers should use -arch_phys_wc_add() in combination with ioremap_wc() to make MTRR effective on -non-PAT systems while a no-op but equally effective on PAT enabled systems. - -Even if Linux does not use MTRRs directly, some x86 platform firmware may still -set up MTRRs early before booting the OS. They do this as some platform -firmware may still have implemented access to MTRRs which would be controlled -and handled by the platform firmware directly. An example of platform use of -MTRRs is through the use of SMI handlers, one case could be for fan control, -the platform code would need uncachable access to some of its fan control -registers. Such platform access does not need any Operating System MTRR code in -place other than mtrr_type_lookup() to ensure any OS specific mapping requests -are aligned with platform MTRR setup. If MTRRs are only set up by the platform -firmware code though and the OS does not make any specific MTRR mapping -requests mtrr_type_lookup() should always return MTRR_TYPE_INVALID. - -For details refer to Documentation/x86/pat.txt. - -=============================================================================== - - On Intel P6 family processors (Pentium Pro, Pentium II and later) - the Memory Type Range Registers (MTRRs) may be used to control - processor access to memory ranges. This is most useful when you have - a video (VGA) card on a PCI or AGP bus. Enabling write-combining - allows bus write transfers to be combined into a larger transfer - before bursting over the PCI/AGP bus. This can increase performance - of image write operations 2.5 times or more. - - The Cyrix 6x86, 6x86MX and M II processors have Address Range - Registers (ARRs) which provide a similar functionality to MTRRs. For - these, the ARRs are used to emulate the MTRRs. - - The AMD K6-2 (stepping 8 and above) and K6-3 processors have two - MTRRs. These are supported. The AMD Athlon family provide 8 Intel - style MTRRs. - - The Centaur C6 (WinChip) has 8 MCRs, allowing write-combining. These - are supported. - - The VIA Cyrix III and VIA C3 CPUs offer 8 Intel style MTRRs. - - The CONFIG_MTRR option creates a /proc/mtrr file which may be used - to manipulate your MTRRs. Typically the X server should use - this. This should have a reasonably generic interface so that - similar control registers on other processors can be easily - supported. - - -There are two interfaces to /proc/mtrr: one is an ASCII interface -which allows you to read and write. The other is an ioctl() -interface. The ASCII interface is meant for administration. The -ioctl() interface is meant for C programs (i.e. the X server). The -interfaces are described below, with sample commands and C code. - -=============================================================================== -Reading MTRRs from the shell: - -% cat /proc/mtrr -reg00: base=0x00000000 ( 0MB), size= 128MB: write-back, count=1 -reg01: base=0x08000000 ( 128MB), size= 64MB: write-back, count=1 -=============================================================================== -Creating MTRRs from the C-shell: -# echo "base=0xf8000000 size=0x400000 type=write-combining" >! /proc/mtrr -or if you use bash: -# echo "base=0xf8000000 size=0x400000 type=write-combining" >| /proc/mtrr - -And the result thereof: -% cat /proc/mtrr -reg00: base=0x00000000 ( 0MB), size= 128MB: write-back, count=1 -reg01: base=0x08000000 ( 128MB), size= 64MB: write-back, count=1 -reg02: base=0xf8000000 (3968MB), size= 4MB: write-combining, count=1 - -This is for video RAM at base address 0xf8000000 and size 4 megabytes. To -find out your base address, you need to look at the output of your X -server, which tells you where the linear framebuffer address is. A -typical line that you may get is: - -(--) S3: PCI: 968 rev 0, Linear FB @ 0xf8000000 - -Note that you should only use the value from the X server, as it may -move the framebuffer base address, so the only value you can trust is -that reported by the X server. - -To find out the size of your framebuffer (what, you don't actually -know?), the following line will tell you: - -(--) S3: videoram: 4096k - -That's 4 megabytes, which is 0x400000 bytes (in hexadecimal). -A patch is being written for XFree86 which will make this automatic: -in other words the X server will manipulate /proc/mtrr using the -ioctl() interface, so users won't have to do anything. If you use a -commercial X server, lobby your vendor to add support for MTRRs. -=============================================================================== -Creating overlapping MTRRs: - -%echo "base=0xfb000000 size=0x1000000 type=write-combining" >/proc/mtrr -%echo "base=0xfb000000 size=0x1000 type=uncachable" >/proc/mtrr - -And the results: cat /proc/mtrr -reg00: base=0x00000000 ( 0MB), size= 64MB: write-back, count=1 -reg01: base=0xfb000000 (4016MB), size= 16MB: write-combining, count=1 -reg02: base=0xfb000000 (4016MB), size= 4kB: uncachable, count=1 - -Some cards (especially Voodoo Graphics boards) need this 4 kB area -excluded from the beginning of the region because it is used for -registers. - -NOTE: You can only create type=uncachable region, if the first -region that you created is type=write-combining. -=============================================================================== -Removing MTRRs from the C-shell: -% echo "disable=2" >! /proc/mtrr -or using bash: -% echo "disable=2" >| /proc/mtrr -=============================================================================== -Reading MTRRs from a C program using ioctl()'s: - -/* mtrr-show.c - - Source file for mtrr-show (example program to show MTRRs using ioctl()'s) - - Copyright (C) 1997-1998 Richard Gooch - - This program is free software; you can redistribute it and/or modify - it under the terms of the GNU General Public License as published by - the Free Software Foundation; either version 2 of the License, or - (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. - - Richard Gooch may be reached by email at rgooch@atnf.csiro.au - The postal address is: - Richard Gooch, c/o ATNF, P. O. Box 76, Epping, N.S.W., 2121, Australia. -*/ - -/* - This program will use an ioctl() on /proc/mtrr to show the current MTRR - settings. This is an alternative to reading /proc/mtrr. - - - Written by Richard Gooch 17-DEC-1997 - - Last updated by Richard Gooch 2-MAY-1998 - - -*/ -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#define TRUE 1 -#define FALSE 0 -#define ERRSTRING strerror (errno) - -static char *mtrr_strings[MTRR_NUM_TYPES] = -{ - "uncachable", /* 0 */ - "write-combining", /* 1 */ - "?", /* 2 */ - "?", /* 3 */ - "write-through", /* 4 */ - "write-protect", /* 5 */ - "write-back", /* 6 */ -}; - -int main () -{ - int fd; - struct mtrr_gentry gentry; - - if ( ( fd = open ("/proc/mtrr", O_RDONLY, 0) ) == -1 ) - { - if (errno == ENOENT) - { - fputs ("/proc/mtrr not found: not supported or you don't have a PPro?\n", - stderr); - exit (1); - } - fprintf (stderr, "Error opening /proc/mtrr\t%s\n", ERRSTRING); - exit (2); - } - for (gentry.regnum = 0; ioctl (fd, MTRRIOC_GET_ENTRY, &gentry) == 0; - ++gentry.regnum) - { - if (gentry.size < 1) - { - fprintf (stderr, "Register: %u disabled\n", gentry.regnum); - continue; - } - fprintf (stderr, "Register: %u base: 0x%lx size: 0x%lx type: %s\n", - gentry.regnum, gentry.base, gentry.size, - mtrr_strings[gentry.type]); - } - if (errno == EINVAL) exit (0); - fprintf (stderr, "Error doing ioctl(2) on /dev/mtrr\t%s\n", ERRSTRING); - exit (3); -} /* End Function main */ -=============================================================================== -Creating MTRRs from a C programme using ioctl()'s: - -/* mtrr-add.c - - Source file for mtrr-add (example programme to add an MTRRs using ioctl()) - - Copyright (C) 1997-1998 Richard Gooch - - This program is free software; you can redistribute it and/or modify - it under the terms of the GNU General Public License as published by - the Free Software Foundation; either version 2 of the License, or - (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. - - Richard Gooch may be reached by email at rgooch@atnf.csiro.au - The postal address is: - Richard Gooch, c/o ATNF, P. O. Box 76, Epping, N.S.W., 2121, Australia. -*/ - -/* - This programme will use an ioctl() on /proc/mtrr to add an entry. The first - available mtrr is used. This is an alternative to writing /proc/mtrr. - - - Written by Richard Gooch 17-DEC-1997 - - Last updated by Richard Gooch 2-MAY-1998 - - -*/ -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#define TRUE 1 -#define FALSE 0 -#define ERRSTRING strerror (errno) - -static char *mtrr_strings[MTRR_NUM_TYPES] = -{ - "uncachable", /* 0 */ - "write-combining", /* 1 */ - "?", /* 2 */ - "?", /* 3 */ - "write-through", /* 4 */ - "write-protect", /* 5 */ - "write-back", /* 6 */ -}; - -int main (int argc, char **argv) -{ - int fd; - struct mtrr_sentry sentry; - - if (argc != 4) - { - fprintf (stderr, "Usage:\tmtrr-add base size type\n"); - exit (1); - } - sentry.base = strtoul (argv[1], NULL, 0); - sentry.size = strtoul (argv[2], NULL, 0); - for (sentry.type = 0; sentry.type < MTRR_NUM_TYPES; ++sentry.type) - { - if (strcmp (argv[3], mtrr_strings[sentry.type]) == 0) break; - } - if (sentry.type >= MTRR_NUM_TYPES) - { - fprintf (stderr, "Illegal type: \"%s\"\n", argv[3]); - exit (2); - } - if ( ( fd = open ("/proc/mtrr", O_WRONLY, 0) ) == -1 ) - { - if (errno == ENOENT) - { - fputs ("/proc/mtrr not found: not supported or you don't have a PPro?\n", - stderr); - exit (3); - } - fprintf (stderr, "Error opening /proc/mtrr\t%s\n", ERRSTRING); - exit (4); - } - if (ioctl (fd, MTRRIOC_ADD_ENTRY, &sentry) == -1) - { - fprintf (stderr, "Error doing ioctl(2) on /dev/mtrr\t%s\n", ERRSTRING); - exit (5); - } - fprintf (stderr, "Sleeping for 5 seconds so you can see the new entry\n"); - sleep (5); - close (fd); - fputs ("I've just closed /proc/mtrr so now the new entry should be gone\n", - stderr); -} /* End Function main */ -=============================================================================== -- cgit v1.2.3 From 2f6eae4730120fd6459f55e47b750cd3570e9349 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:25 +0800 Subject: Documentation: x86: convert pat.txt to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Cc: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/index.rst | 1 + Documentation/x86/pat.rst | 242 ++++++++++++++++++++++++++++++++++++++++++++ Documentation/x86/pat.txt | 230 ----------------------------------------- 3 files changed, 243 insertions(+), 230 deletions(-) create mode 100644 Documentation/x86/pat.rst delete mode 100644 Documentation/x86/pat.txt (limited to 'Documentation/x86') diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index 769c4491e9cb..f7012e4afacd 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -17,3 +17,4 @@ x86-specific Documentation zero-page tlb mtrr + pat diff --git a/Documentation/x86/pat.rst b/Documentation/x86/pat.rst new file mode 100644 index 000000000000..9a298fd97d74 --- /dev/null +++ b/Documentation/x86/pat.rst @@ -0,0 +1,242 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +PAT (Page Attribute Table) +========================== + +x86 Page Attribute Table (PAT) allows for setting the memory attribute at the +page level granularity. PAT is complementary to the MTRR settings which allows +for setting of memory types over physical address ranges. However, PAT is +more flexible than MTRR due to its capability to set attributes at page level +and also due to the fact that there are no hardware limitations on number of +such attribute settings allowed. Added flexibility comes with guidelines for +not having memory type aliasing for the same physical memory with multiple +virtual addresses. + +PAT allows for different types of memory attributes. The most commonly used +ones that will be supported at this time are: + +=== ============== +WB Write-back +UC Uncached +WC Write-combined +WT Write-through +UC- Uncached Minus +=== ============== + + +PAT APIs +======== + +There are many different APIs in the kernel that allows setting of memory +attributes at the page level. In order to avoid aliasing, these interfaces +should be used thoughtfully. Below is a table of interfaces available, +their intended usage and their memory attribute relationships. Internally, +these APIs use a reserve_memtype()/free_memtype() interface on the physical +address range to avoid any aliasing. + ++------------------------+----------+--------------+------------------+ +| API | RAM | ACPI,... | Reserved/Holes | ++------------------------+----------+--------------+------------------+ +| ioremap | -- | UC- | UC- | ++------------------------+----------+--------------+------------------+ +| ioremap_cache | -- | WB | WB | ++------------------------+----------+--------------+------------------+ +| ioremap_uc | -- | UC | UC | ++------------------------+----------+--------------+------------------+ +| ioremap_nocache | -- | UC- | UC- | ++------------------------+----------+--------------+------------------+ +| ioremap_wc | -- | -- | WC | ++------------------------+----------+--------------+------------------+ +| ioremap_wt | -- | -- | WT | ++------------------------+----------+--------------+------------------+ +| set_memory_uc, | UC- | -- | -- | +| set_memory_wb | | | | ++------------------------+----------+--------------+------------------+ +| set_memory_wc, | WC | -- | -- | +| set_memory_wb | | | | ++------------------------+----------+--------------+------------------+ +| set_memory_wt, | WT | -- | -- | +| set_memory_wb | | | | ++------------------------+----------+--------------+------------------+ +| pci sysfs resource | -- | -- | UC- | ++------------------------+----------+--------------+------------------+ +| pci sysfs resource_wc | -- | -- | WC | +| is IORESOURCE_PREFETCH | | | | ++------------------------+----------+--------------+------------------+ +| pci proc | -- | -- | UC- | +| !PCIIOC_WRITE_COMBINE | | | | ++------------------------+----------+--------------+------------------+ +| pci proc | -- | -- | WC | +| PCIIOC_WRITE_COMBINE | | | | ++------------------------+----------+--------------+------------------+ +| /dev/mem | -- | WB/WC/UC- | WB/WC/UC- | +| read-write | | | | ++------------------------+----------+--------------+------------------+ +| /dev/mem | -- | UC- | UC- | +| mmap SYNC flag | | | | ++------------------------+----------+--------------+------------------+ +| /dev/mem | -- | WB/WC/UC- | WB/WC/UC- | +| mmap !SYNC flag | | | | +| and | |(from existing| (from existing | +| any alias to this area | |alias) | alias) | ++------------------------+----------+--------------+------------------+ +| /dev/mem | -- | WB | WB | +| mmap !SYNC flag | | | | +| no alias to this area | | | | +| and | | | | +| MTRR says WB | | | | ++------------------------+----------+--------------+------------------+ +| /dev/mem | -- | -- | UC- | +| mmap !SYNC flag | | | | +| no alias to this area | | | | +| and | | | | +| MTRR says !WB | | | | ++------------------------+----------+--------------+------------------+ + + +Advanced APIs for drivers +========================= + +A. Exporting pages to users with remap_pfn_range, io_remap_pfn_range, +vmf_insert_pfn. + +Drivers wanting to export some pages to userspace do it by using mmap +interface and a combination of: + + 1) pgprot_noncached() + 2) io_remap_pfn_range() or remap_pfn_range() or vmf_insert_pfn() + +With PAT support, a new API pgprot_writecombine is being added. So, drivers can +continue to use the above sequence, with either pgprot_noncached() or +pgprot_writecombine() in step 1, followed by step 2. + +In addition, step 2 internally tracks the region as UC or WC in memtype +list in order to ensure no conflicting mapping. + +Note that this set of APIs only works with IO (non RAM) regions. If driver +wants to export a RAM region, it has to do set_memory_uc() or set_memory_wc() +as step 0 above and also track the usage of those pages and use set_memory_wb() +before the page is freed to free pool. + +MTRR effects on PAT / non-PAT systems +===================================== + +The following table provides the effects of using write-combining MTRRs when +using ioremap*() calls on x86 for both non-PAT and PAT systems. Ideally +mtrr_add() usage will be phased out in favor of arch_phys_wc_add() which will +be a no-op on PAT enabled systems. The region over which a arch_phys_wc_add() +is made, should already have been ioremapped with WC attributes or PAT entries, +this can be done by using ioremap_wc() / set_memory_wc(). Devices which +combine areas of IO memory desired to remain uncacheable with areas where +write-combining is desirable should consider use of ioremap_uc() followed by +set_memory_wc() to white-list effective write-combined areas. Such use is +nevertheless discouraged as the effective memory type is considered +implementation defined, yet this strategy can be used as last resort on devices +with size-constrained regions where otherwise MTRR write-combining would +otherwise not be effective. +:: + + ==== ======= === ========================= ===================== + MTRR Non-PAT PAT Linux ioremap value Effective memory type + ==== ======= === ========================= ===================== + PAT Non-PAT | PAT + |PCD | + ||PWT | + ||| | + WC 000 WB _PAGE_CACHE_MODE_WB WC | WC + WC 001 WC _PAGE_CACHE_MODE_WC WC* | WC + WC 010 UC- _PAGE_CACHE_MODE_UC_MINUS WC* | UC + WC 011 UC _PAGE_CACHE_MODE_UC UC | UC + ==== ======= === ========================= ===================== + + (*) denotes implementation defined and is discouraged + +.. note:: -- in the above table mean "Not suggested usage for the API". Some + of the --'s are strictly enforced by the kernel. Some others are not really + enforced today, but may be enforced in future. + +For ioremap and pci access through /sys or /proc - The actual type returned +can be more restrictive, in case of any existing aliasing for that address. +For example: If there is an existing uncached mapping, a new ioremap_wc can +return uncached mapping in place of write-combine requested. + +set_memory_[uc|wc|wt] and set_memory_wb should be used in pairs, where driver +will first make a region uc, wc or wt and switch it back to wb after use. + +Over time writes to /proc/mtrr will be deprecated in favor of using PAT based +interfaces. Users writing to /proc/mtrr are suggested to use above interfaces. + +Drivers should use ioremap_[uc|wc] to access PCI BARs with [uc|wc] access +types. + +Drivers should use set_memory_[uc|wc|wt] to set access type for RAM ranges. + + +PAT debugging +============= + +With CONFIG_DEBUG_FS enabled, PAT memtype list can be examined by:: + + # mount -t debugfs debugfs /sys/kernel/debug + # cat /sys/kernel/debug/x86/pat_memtype_list + PAT memtype list: + uncached-minus @ 0x7fadf000-0x7fae0000 + uncached-minus @ 0x7fb19000-0x7fb1a000 + uncached-minus @ 0x7fb1a000-0x7fb1b000 + uncached-minus @ 0x7fb1b000-0x7fb1c000 + uncached-minus @ 0x7fb1c000-0x7fb1d000 + uncached-minus @ 0x7fb1d000-0x7fb1e000 + uncached-minus @ 0x7fb1e000-0x7fb25000 + uncached-minus @ 0x7fb25000-0x7fb26000 + uncached-minus @ 0x7fb26000-0x7fb27000 + uncached-minus @ 0x7fb27000-0x7fb28000 + uncached-minus @ 0x7fb28000-0x7fb2e000 + uncached-minus @ 0x7fb2e000-0x7fb2f000 + uncached-minus @ 0x7fb2f000-0x7fb30000 + uncached-minus @ 0x7fb31000-0x7fb32000 + uncached-minus @ 0x80000000-0x90000000 + +This list shows physical address ranges and various PAT settings used to +access those physical address ranges. + +Another, more verbose way of getting PAT related debug messages is with +"debugpat" boot parameter. With this parameter, various debug messages are +printed to dmesg log. + +PAT Initialization +================== + +The following table describes how PAT is initialized under various +configurations. The PAT MSR must be updated by Linux in order to support WC +and WT attributes. Otherwise, the PAT MSR has the value programmed in it +by the firmware. Note, Xen enables WC attribute in the PAT MSR for guests. + + ==== ===== ========================== ========= ======= + MTRR PAT Call Sequence PAT State PAT MSR + ==== ===== ========================== ========= ======= + E E MTRR -> PAT init Enabled OS + E D MTRR -> PAT init Disabled - + D E MTRR -> PAT disable Disabled BIOS + D D MTRR -> PAT disable Disabled - + - np/E PAT -> PAT disable Disabled BIOS + - np/D PAT -> PAT disable Disabled - + E !P/E MTRR -> PAT init Disabled BIOS + D !P/E MTRR -> PAT disable Disabled BIOS + !M !P/E MTRR stub -> PAT disable Disabled BIOS + ==== ===== ========================== ========= ======= + + Legend + + ========= ======================================= + E Feature enabled in CPU + D Feature disabled/unsupported in CPU + np "nopat" boot option specified + !P CONFIG_X86_PAT option unset + !M CONFIG_MTRR option unset + Enabled PAT state set to enabled + Disabled PAT state set to disabled + OS PAT initializes PAT MSR with OS setting + BIOS PAT keeps PAT MSR with BIOS setting + ========= ======================================= + diff --git a/Documentation/x86/pat.txt b/Documentation/x86/pat.txt deleted file mode 100644 index 481d8d8536ac..000000000000 --- a/Documentation/x86/pat.txt +++ /dev/null @@ -1,230 +0,0 @@ - -PAT (Page Attribute Table) - -x86 Page Attribute Table (PAT) allows for setting the memory attribute at the -page level granularity. PAT is complementary to the MTRR settings which allows -for setting of memory types over physical address ranges. However, PAT is -more flexible than MTRR due to its capability to set attributes at page level -and also due to the fact that there are no hardware limitations on number of -such attribute settings allowed. Added flexibility comes with guidelines for -not having memory type aliasing for the same physical memory with multiple -virtual addresses. - -PAT allows for different types of memory attributes. The most commonly used -ones that will be supported at this time are Write-back, Uncached, -Write-combined, Write-through and Uncached Minus. - - -PAT APIs --------- - -There are many different APIs in the kernel that allows setting of memory -attributes at the page level. In order to avoid aliasing, these interfaces -should be used thoughtfully. Below is a table of interfaces available, -their intended usage and their memory attribute relationships. Internally, -these APIs use a reserve_memtype()/free_memtype() interface on the physical -address range to avoid any aliasing. - - -------------------------------------------------------------------- -API | RAM | ACPI,... | Reserved/Holes | ------------------------|----------|------------|------------------| - | | | | -ioremap | -- | UC- | UC- | - | | | | -ioremap_cache | -- | WB | WB | - | | | | -ioremap_uc | -- | UC | UC | - | | | | -ioremap_nocache | -- | UC- | UC- | - | | | | -ioremap_wc | -- | -- | WC | - | | | | -ioremap_wt | -- | -- | WT | - | | | | -set_memory_uc | UC- | -- | -- | - set_memory_wb | | | | - | | | | -set_memory_wc | WC | -- | -- | - set_memory_wb | | | | - | | | | -set_memory_wt | WT | -- | -- | - set_memory_wb | | | | - | | | | -pci sysfs resource | -- | -- | UC- | - | | | | -pci sysfs resource_wc | -- | -- | WC | - is IORESOURCE_PREFETCH| | | | - | | | | -pci proc | -- | -- | UC- | - !PCIIOC_WRITE_COMBINE | | | | - | | | | -pci proc | -- | -- | WC | - PCIIOC_WRITE_COMBINE | | | | - | | | | -/dev/mem | -- | WB/WC/UC- | WB/WC/UC- | - read-write | | | | - | | | | -/dev/mem | -- | UC- | UC- | - mmap SYNC flag | | | | - | | | | -/dev/mem | -- | WB/WC/UC- | WB/WC/UC- | - mmap !SYNC flag | |(from exist-| (from exist- | - and | | ing alias)| ing alias) | - any alias to this area| | | | - | | | | -/dev/mem | -- | WB | WB | - mmap !SYNC flag | | | | - no alias to this area | | | | - and | | | | - MTRR says WB | | | | - | | | | -/dev/mem | -- | -- | UC- | - mmap !SYNC flag | | | | - no alias to this area | | | | - and | | | | - MTRR says !WB | | | | - | | | | -------------------------------------------------------------------- - -Advanced APIs for drivers -------------------------- -A. Exporting pages to users with remap_pfn_range, io_remap_pfn_range, -vmf_insert_pfn - -Drivers wanting to export some pages to userspace do it by using mmap -interface and a combination of -1) pgprot_noncached() -2) io_remap_pfn_range() or remap_pfn_range() or vmf_insert_pfn() - -With PAT support, a new API pgprot_writecombine is being added. So, drivers can -continue to use the above sequence, with either pgprot_noncached() or -pgprot_writecombine() in step 1, followed by step 2. - -In addition, step 2 internally tracks the region as UC or WC in memtype -list in order to ensure no conflicting mapping. - -Note that this set of APIs only works with IO (non RAM) regions. If driver -wants to export a RAM region, it has to do set_memory_uc() or set_memory_wc() -as step 0 above and also track the usage of those pages and use set_memory_wb() -before the page is freed to free pool. - -MTRR effects on PAT / non-PAT systems -------------------------------------- - -The following table provides the effects of using write-combining MTRRs when -using ioremap*() calls on x86 for both non-PAT and PAT systems. Ideally -mtrr_add() usage will be phased out in favor of arch_phys_wc_add() which will -be a no-op on PAT enabled systems. The region over which a arch_phys_wc_add() -is made, should already have been ioremapped with WC attributes or PAT entries, -this can be done by using ioremap_wc() / set_memory_wc(). Devices which -combine areas of IO memory desired to remain uncacheable with areas where -write-combining is desirable should consider use of ioremap_uc() followed by -set_memory_wc() to white-list effective write-combined areas. Such use is -nevertheless discouraged as the effective memory type is considered -implementation defined, yet this strategy can be used as last resort on devices -with size-constrained regions where otherwise MTRR write-combining would -otherwise not be effective. - ----------------------------------------------------------------------- -MTRR Non-PAT PAT Linux ioremap value Effective memory type ----------------------------------------------------------------------- - Non-PAT | PAT - PAT - |PCD - ||PWT - ||| -WC 000 WB _PAGE_CACHE_MODE_WB WC | WC -WC 001 WC _PAGE_CACHE_MODE_WC WC* | WC -WC 010 UC- _PAGE_CACHE_MODE_UC_MINUS WC* | UC -WC 011 UC _PAGE_CACHE_MODE_UC UC | UC ----------------------------------------------------------------------- - -(*) denotes implementation defined and is discouraged - -Notes: - --- in the above table mean "Not suggested usage for the API". Some of the --'s -are strictly enforced by the kernel. Some others are not really enforced -today, but may be enforced in future. - -For ioremap and pci access through /sys or /proc - The actual type returned -can be more restrictive, in case of any existing aliasing for that address. -For example: If there is an existing uncached mapping, a new ioremap_wc can -return uncached mapping in place of write-combine requested. - -set_memory_[uc|wc|wt] and set_memory_wb should be used in pairs, where driver -will first make a region uc, wc or wt and switch it back to wb after use. - -Over time writes to /proc/mtrr will be deprecated in favor of using PAT based -interfaces. Users writing to /proc/mtrr are suggested to use above interfaces. - -Drivers should use ioremap_[uc|wc] to access PCI BARs with [uc|wc] access -types. - -Drivers should use set_memory_[uc|wc|wt] to set access type for RAM ranges. - - -PAT debugging -------------- - -With CONFIG_DEBUG_FS enabled, PAT memtype list can be examined by - -# mount -t debugfs debugfs /sys/kernel/debug -# cat /sys/kernel/debug/x86/pat_memtype_list -PAT memtype list: -uncached-minus @ 0x7fadf000-0x7fae0000 -uncached-minus @ 0x7fb19000-0x7fb1a000 -uncached-minus @ 0x7fb1a000-0x7fb1b000 -uncached-minus @ 0x7fb1b000-0x7fb1c000 -uncached-minus @ 0x7fb1c000-0x7fb1d000 -uncached-minus @ 0x7fb1d000-0x7fb1e000 -uncached-minus @ 0x7fb1e000-0x7fb25000 -uncached-minus @ 0x7fb25000-0x7fb26000 -uncached-minus @ 0x7fb26000-0x7fb27000 -uncached-minus @ 0x7fb27000-0x7fb28000 -uncached-minus @ 0x7fb28000-0x7fb2e000 -uncached-minus @ 0x7fb2e000-0x7fb2f000 -uncached-minus @ 0x7fb2f000-0x7fb30000 -uncached-minus @ 0x7fb31000-0x7fb32000 -uncached-minus @ 0x80000000-0x90000000 - -This list shows physical address ranges and various PAT settings used to -access those physical address ranges. - -Another, more verbose way of getting PAT related debug messages is with -"debugpat" boot parameter. With this parameter, various debug messages are -printed to dmesg log. - -PAT Initialization ------------------- - -The following table describes how PAT is initialized under various -configurations. The PAT MSR must be updated by Linux in order to support WC -and WT attributes. Otherwise, the PAT MSR has the value programmed in it -by the firmware. Note, Xen enables WC attribute in the PAT MSR for guests. - - MTRR PAT Call Sequence PAT State PAT MSR - ========================================================= - E E MTRR -> PAT init Enabled OS - E D MTRR -> PAT init Disabled - - D E MTRR -> PAT disable Disabled BIOS - D D MTRR -> PAT disable Disabled - - - np/E PAT -> PAT disable Disabled BIOS - - np/D PAT -> PAT disable Disabled - - E !P/E MTRR -> PAT init Disabled BIOS - D !P/E MTRR -> PAT disable Disabled BIOS - !M !P/E MTRR stub -> PAT disable Disabled BIOS - - Legend - ------------------------------------------------ - E Feature enabled in CPU - D Feature disabled/unsupported in CPU - np "nopat" boot option specified - !P CONFIG_X86_PAT option unset - !M CONFIG_MTRR option unset - Enabled PAT state set to enabled - Disabled PAT state set to disabled - OS PAT initializes PAT MSR with OS setting - BIOS PAT keeps PAT MSR with BIOS setting - -- cgit v1.2.3 From 28e21eac94a2ee2512ae6c21f04a0b41fb26cb0b Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:26 +0800 Subject: Documentation: x86: convert protection-keys.txt to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/index.rst | 1 + Documentation/x86/protection-keys.rst | 99 +++++++++++++++++++++++++++++++++++ Documentation/x86/protection-keys.txt | 90 ------------------------------- 3 files changed, 100 insertions(+), 90 deletions(-) create mode 100644 Documentation/x86/protection-keys.rst delete mode 100644 Documentation/x86/protection-keys.txt (limited to 'Documentation/x86') diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index f7012e4afacd..e2c0db9fcd4e 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -18,3 +18,4 @@ x86-specific Documentation tlb mtrr pat + protection-keys diff --git a/Documentation/x86/protection-keys.rst b/Documentation/x86/protection-keys.rst new file mode 100644 index 000000000000..49d9833af871 --- /dev/null +++ b/Documentation/x86/protection-keys.rst @@ -0,0 +1,99 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================== +Memory Protection Keys +====================== + +Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature +which is found on Intel's Skylake "Scalable Processor" Server CPUs. +It will be avalable in future non-server parts. + +For anyone wishing to test or use this feature, it is available in +Amazon's EC2 C5 instances and is known to work there using an Ubuntu +17.04 image. + +Memory Protection Keys provides a mechanism for enforcing page-based +protections, but without requiring modification of the page tables +when an application changes protection domains. It works by +dedicating 4 previously ignored bits in each page table entry to a +"protection key", giving 16 possible keys. + +There is also a new user-accessible register (PKRU) with two separate +bits (Access Disable and Write Disable) for each key. Being a CPU +register, PKRU is inherently thread-local, potentially giving each +thread a different set of protections from every other thread. + +There are two new instructions (RDPKRU/WRPKRU) for reading and writing +to the new register. The feature is only available in 64-bit mode, +even though there is theoretically space in the PAE PTEs. These +permissions are enforced on data access only and have no effect on +instruction fetches. + +Syscalls +======== + +There are 3 system calls which directly interact with pkeys:: + + int pkey_alloc(unsigned long flags, unsigned long init_access_rights) + int pkey_free(int pkey); + int pkey_mprotect(unsigned long start, size_t len, + unsigned long prot, int pkey); + +Before a pkey can be used, it must first be allocated with +pkey_alloc(). An application calls the WRPKRU instruction +directly in order to change access permissions to memory covered +with a key. In this example WRPKRU is wrapped by a C function +called pkey_set(). +:: + + int real_prot = PROT_READ|PROT_WRITE; + pkey = pkey_alloc(0, PKEY_DISABLE_WRITE); + ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); + ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey); + ... application runs here + +Now, if the application needs to update the data at 'ptr', it can +gain access, do the update, then remove its write access:: + + pkey_set(pkey, 0); // clear PKEY_DISABLE_WRITE + *ptr = foo; // assign something + pkey_set(pkey, PKEY_DISABLE_WRITE); // set PKEY_DISABLE_WRITE again + +Now when it frees the memory, it will also free the pkey since it +is no longer in use:: + + munmap(ptr, PAGE_SIZE); + pkey_free(pkey); + +.. note:: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions. + An example implementation can be found in + tools/testing/selftests/x86/protection_keys.c. + +Behavior +======== + +The kernel attempts to make protection keys consistent with the +behavior of a plain mprotect(). For instance if you do this:: + + mprotect(ptr, size, PROT_NONE); + something(ptr); + +you can expect the same effects with protection keys when doing this:: + + pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ); + pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey); + something(ptr); + +That should be true whether something() is a direct access to 'ptr' +like:: + + *ptr = foo; + +or when the kernel does the access on the application's behalf like +with a read():: + + read(fd, ptr, 1); + +The kernel will send a SIGSEGV in both cases, but si_code will be set +to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when +the plain mprotect() permissions are violated. diff --git a/Documentation/x86/protection-keys.txt b/Documentation/x86/protection-keys.txt deleted file mode 100644 index ecb0d2dadfb7..000000000000 --- a/Documentation/x86/protection-keys.txt +++ /dev/null @@ -1,90 +0,0 @@ -Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature -which is found on Intel's Skylake "Scalable Processor" Server CPUs. -It will be avalable in future non-server parts. - -For anyone wishing to test or use this feature, it is available in -Amazon's EC2 C5 instances and is known to work there using an Ubuntu -17.04 image. - -Memory Protection Keys provides a mechanism for enforcing page-based -protections, but without requiring modification of the page tables -when an application changes protection domains. It works by -dedicating 4 previously ignored bits in each page table entry to a -"protection key", giving 16 possible keys. - -There is also a new user-accessible register (PKRU) with two separate -bits (Access Disable and Write Disable) for each key. Being a CPU -register, PKRU is inherently thread-local, potentially giving each -thread a different set of protections from every other thread. - -There are two new instructions (RDPKRU/WRPKRU) for reading and writing -to the new register. The feature is only available in 64-bit mode, -even though there is theoretically space in the PAE PTEs. These -permissions are enforced on data access only and have no effect on -instruction fetches. - -=========================== Syscalls =========================== - -There are 3 system calls which directly interact with pkeys: - - int pkey_alloc(unsigned long flags, unsigned long init_access_rights) - int pkey_free(int pkey); - int pkey_mprotect(unsigned long start, size_t len, - unsigned long prot, int pkey); - -Before a pkey can be used, it must first be allocated with -pkey_alloc(). An application calls the WRPKRU instruction -directly in order to change access permissions to memory covered -with a key. In this example WRPKRU is wrapped by a C function -called pkey_set(). - - int real_prot = PROT_READ|PROT_WRITE; - pkey = pkey_alloc(0, PKEY_DISABLE_WRITE); - ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); - ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey); - ... application runs here - -Now, if the application needs to update the data at 'ptr', it can -gain access, do the update, then remove its write access: - - pkey_set(pkey, 0); // clear PKEY_DISABLE_WRITE - *ptr = foo; // assign something - pkey_set(pkey, PKEY_DISABLE_WRITE); // set PKEY_DISABLE_WRITE again - -Now when it frees the memory, it will also free the pkey since it -is no longer in use: - - munmap(ptr, PAGE_SIZE); - pkey_free(pkey); - -(Note: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions. - An example implementation can be found in - tools/testing/selftests/x86/protection_keys.c) - -=========================== Behavior =========================== - -The kernel attempts to make protection keys consistent with the -behavior of a plain mprotect(). For instance if you do this: - - mprotect(ptr, size, PROT_NONE); - something(ptr); - -you can expect the same effects with protection keys when doing this: - - pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ); - pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey); - something(ptr); - -That should be true whether something() is a direct access to 'ptr' -like: - - *ptr = foo; - -or when the kernel does the access on the application's behalf like -with a read(): - - read(fd, ptr, 1); - -The kernel will send a SIGSEGV in both cases, but si_code will be set -to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when -the plain mprotect() permissions are violated. -- cgit v1.2.3 From f10b07a01a48d0584fa9815005e04c54058e2e47 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:27 +0800 Subject: Documentation: x86: convert intel_mpx.txt to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/index.rst | 1 + Documentation/x86/intel_mpx.rst | 252 ++++++++++++++++++++++++++++++++++++++++ Documentation/x86/intel_mpx.txt | 244 -------------------------------------- 3 files changed, 253 insertions(+), 244 deletions(-) create mode 100644 Documentation/x86/intel_mpx.rst delete mode 100644 Documentation/x86/intel_mpx.txt (limited to 'Documentation/x86') diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index e2c0db9fcd4e..b5cdc0d889b3 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -19,3 +19,4 @@ x86-specific Documentation mtrr pat protection-keys + intel_mpx diff --git a/Documentation/x86/intel_mpx.rst b/Documentation/x86/intel_mpx.rst new file mode 100644 index 000000000000..387a640941a6 --- /dev/null +++ b/Documentation/x86/intel_mpx.rst @@ -0,0 +1,252 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================================== +Intel(R) Memory Protection Extensions (MPX) +=========================================== + +Intel(R) MPX Overview +===================== + +Intel(R) Memory Protection Extensions (Intel(R) MPX) is a new capability +introduced into Intel Architecture. Intel MPX provides hardware features +that can be used in conjunction with compiler changes to check memory +references, for those references whose compile-time normal intentions are +usurped at runtime due to buffer overflow or underflow. + +You can tell if your CPU supports MPX by looking in /proc/cpuinfo:: + + cat /proc/cpuinfo | grep ' mpx ' + +For more information, please refer to Intel(R) Architecture Instruction +Set Extensions Programming Reference, Chapter 9: Intel(R) Memory Protection +Extensions. + +Note: As of December 2014, no hardware with MPX is available but it is +possible to use SDE (Intel(R) Software Development Emulator) instead, which +can be downloaded from +http://software.intel.com/en-us/articles/intel-software-development-emulator + + +How to get the advantage of MPX +=============================== + +For MPX to work, changes are required in the kernel, binutils and compiler. +No source changes are required for applications, just a recompile. + +There are a lot of moving parts of this to all work right. The following +is how we expect the compiler, application and kernel to work together. + +1) Application developer compiles with -fmpx. The compiler will add the + instrumentation as well as some setup code called early after the app + starts. New instruction prefixes are noops for old CPUs. +2) That setup code allocates (virtual) space for the "bounds directory", + points the "bndcfgu" register to the directory (must also set the valid + bit) and notifies the kernel (via the new prctl(PR_MPX_ENABLE_MANAGEMENT)) + that the app will be using MPX. The app must be careful not to access + the bounds tables between the time when it populates "bndcfgu" and + when it calls the prctl(). This might be hard to guarantee if the app + is compiled with MPX. You can add "__attribute__((bnd_legacy))" to + the function to disable MPX instrumentation to help guarantee this. + Also be careful not to call out to any other code which might be + MPX-instrumented. +3) The kernel detects that the CPU has MPX, allows the new prctl() to + succeed, and notes the location of the bounds directory. Userspace is + expected to keep the bounds directory at that location. We note it + instead of reading it each time because the 'xsave' operation needed + to access the bounds directory register is an expensive operation. +4) If the application needs to spill bounds out of the 4 registers, it + issues a bndstx instruction. Since the bounds directory is empty at + this point, a bounds fault (#BR) is raised, the kernel allocates a + bounds table (in the user address space) and makes the relevant entry + in the bounds directory point to the new table. +5) If the application violates the bounds specified in the bounds registers, + a separate kind of #BR is raised which will deliver a signal with + information about the violation in the 'struct siginfo'. +6) Whenever memory is freed, we know that it can no longer contain valid + pointers, and we attempt to free the associated space in the bounds + tables. If an entire table becomes unused, we will attempt to free + the table and remove the entry in the directory. + +To summarize, there are essentially three things interacting here: + +GCC with -fmpx: + * enables annotation of code with MPX instructions and prefixes + * inserts code early in the application to call in to the "gcc runtime" +GCC MPX Runtime: + * Checks for hardware MPX support in cpuid leaf + * allocates virtual space for the bounds directory (malloc() essentially) + * points the hardware BNDCFGU register at the directory + * calls a new prctl(PR_MPX_ENABLE_MANAGEMENT) to notify the kernel to + start managing the bounds directories +Kernel MPX Code: + * Checks for hardware MPX support in cpuid leaf + * Handles #BR exceptions and sends SIGSEGV to the app when it violates + bounds, like during a buffer overflow. + * When bounds are spilled in to an unallocated bounds table, the kernel + notices in the #BR exception, allocates the virtual space, then + updates the bounds directory to point to the new table. It keeps + special track of the memory with a VM_MPX flag. + * Frees unused bounds tables at the time that the memory they described + is unmapped. + + +How does MPX kernel code work +============================= + +Handling #BR faults caused by MPX +--------------------------------- + +When MPX is enabled, there are 2 new situations that can generate +#BR faults. + + * new bounds tables (BT) need to be allocated to save bounds. + * bounds violation caused by MPX instructions. + +We hook #BR handler to handle these two new situations. + +On-demand kernel allocation of bounds tables +-------------------------------------------- + +MPX only has 4 hardware registers for storing bounds information. If +MPX-enabled code needs more than these 4 registers, it needs to spill +them somewhere. It has two special instructions for this which allow +the bounds to be moved between the bounds registers and some new "bounds +tables". + +#BR exceptions are a new class of exceptions just for MPX. They are +similar conceptually to a page fault and will be raised by the MPX +hardware during both bounds violations or when the tables are not +present. The kernel handles those #BR exceptions for not-present tables +by carving the space out of the normal processes address space and then +pointing the bounds-directory over to it. + +The tables need to be accessed and controlled by userspace because +the instructions for moving bounds in and out of them are extremely +frequent. They potentially happen every time a register points to +memory. Any direct kernel involvement (like a syscall) to access the +tables would obviously destroy performance. + +Why not do this in userspace? MPX does not strictly require anything in +the kernel. It can theoretically be done completely from userspace. Here +are a few ways this could be done. We don't think any of them are practical +in the real-world, but here they are. + +:Q: Can virtual space simply be reserved for the bounds tables so that we + never have to allocate them? +:A: MPX-enabled application will possibly create a lot of bounds tables in + process address space to save bounds information. These tables can take + up huge swaths of memory (as much as 80% of the memory on the system) + even if we clean them up aggressively. In the worst-case scenario, the + tables can be 4x the size of the data structure being tracked. IOW, a + 1-page structure can require 4 bounds-table pages. An X-GB virtual + area needs 4*X GB of virtual space, plus 2GB for the bounds directory. + If we were to preallocate them for the 128TB of user virtual address + space, we would need to reserve 512TB+2GB, which is larger than the + entire virtual address space today. This means they can not be reserved + ahead of time. Also, a single process's pre-populated bounds directory + consumes 2GB of virtual *AND* physical memory. IOW, it's completely + infeasible to prepopulate bounds directories. + +:Q: Can we preallocate bounds table space at the same time memory is + allocated which might contain pointers that might eventually need + bounds tables? +:A: This would work if we could hook the site of each and every memory + allocation syscall. This can be done for small, constrained applications. + But, it isn't practical at a larger scale since a given app has no + way of controlling how all the parts of the app might allocate memory + (think libraries). The kernel is really the only place to intercept + these calls. + +:Q: Could a bounds fault be handed to userspace and the tables allocated + there in a signal handler instead of in the kernel? +:A: mmap() is not on the list of safe async handler functions and even + if mmap() would work it still requires locking or nasty tricks to + keep track of the allocation state there. + +Having ruled out all of the userspace-only approaches for managing +bounds tables that we could think of, we create them on demand in +the kernel. + +Decoding MPX instructions +------------------------- + +If a #BR is generated due to a bounds violation caused by MPX. +We need to decode MPX instructions to get violation address and +set this address into extended struct siginfo. + +The _sigfault field of struct siginfo is extended as follow:: + + 87 /* SIGILL, SIGFPE, SIGSEGV, SIGBUS */ + 88 struct { + 89 void __user *_addr; /* faulting insn/memory ref. */ + 90 #ifdef __ARCH_SI_TRAPNO + 91 int _trapno; /* TRAP # which caused the signal */ + 92 #endif + 93 short _addr_lsb; /* LSB of the reported address */ + 94 struct { + 95 void __user *_lower; + 96 void __user *_upper; + 97 } _addr_bnd; + 98 } _sigfault; + +The '_addr' field refers to violation address, and new '_addr_and' +field refers to the upper/lower bounds when a #BR is caused. + +Glibc will be also updated to support this new siginfo. So user +can get violation address and bounds when bounds violations occur. + +Cleanup unused bounds tables +---------------------------- + +When a BNDSTX instruction attempts to save bounds to a bounds directory +entry marked as invalid, a #BR is generated. This is an indication that +no bounds table exists for this entry. In this case the fault handler +will allocate a new bounds table on demand. + +Since the kernel allocated those tables on-demand without userspace +knowledge, it is also responsible for freeing them when the associated +mappings go away. + +Here, the solution for this issue is to hook do_munmap() to check +whether one process is MPX enabled. If yes, those bounds tables covered +in the virtual address region which is being unmapped will be freed also. + +Adding new prctl commands +------------------------- + +Two new prctl commands are added to enable and disable MPX bounds tables +management in kernel. +:: + + 155 #define PR_MPX_ENABLE_MANAGEMENT 43 + 156 #define PR_MPX_DISABLE_MANAGEMENT 44 + +Runtime library in userspace is responsible for allocation of bounds +directory. So kernel have to use XSAVE instruction to get the base +of bounds directory from BNDCFG register. + +But XSAVE is expected to be very expensive. In order to do performance +optimization, we have to get the base of bounds directory and save it +into struct mm_struct to be used in future during PR_MPX_ENABLE_MANAGEMENT +command execution. + + +Special rules +============= + +1) If userspace is requesting help from the kernel to do the management +of bounds tables, it may not create or modify entries in the bounds directory. + +Certainly users can allocate bounds tables and forcibly point the bounds +directory at them through XSAVE instruction, and then set valid bit +of bounds entry to have this entry valid. But, the kernel will decline +to assist in managing these tables. + +2) Userspace may not take multiple bounds directory entries and point +them at the same bounds table. + +This is allowed architecturally. See more information "Intel(R) Architecture +Instruction Set Extensions Programming Reference" (9.3.4). + +However, if users did this, the kernel might be fooled in to unmapping an +in-use bounds table since it does not recognize sharing. diff --git a/Documentation/x86/intel_mpx.txt b/Documentation/x86/intel_mpx.txt deleted file mode 100644 index 85d0549ad846..000000000000 --- a/Documentation/x86/intel_mpx.txt +++ /dev/null @@ -1,244 +0,0 @@ -1. Intel(R) MPX Overview -======================== - -Intel(R) Memory Protection Extensions (Intel(R) MPX) is a new capability -introduced into Intel Architecture. Intel MPX provides hardware features -that can be used in conjunction with compiler changes to check memory -references, for those references whose compile-time normal intentions are -usurped at runtime due to buffer overflow or underflow. - -You can tell if your CPU supports MPX by looking in /proc/cpuinfo: - - cat /proc/cpuinfo | grep ' mpx ' - -For more information, please refer to Intel(R) Architecture Instruction -Set Extensions Programming Reference, Chapter 9: Intel(R) Memory Protection -Extensions. - -Note: As of December 2014, no hardware with MPX is available but it is -possible to use SDE (Intel(R) Software Development Emulator) instead, which -can be downloaded from -http://software.intel.com/en-us/articles/intel-software-development-emulator - - -2. How to get the advantage of MPX -================================== - -For MPX to work, changes are required in the kernel, binutils and compiler. -No source changes are required for applications, just a recompile. - -There are a lot of moving parts of this to all work right. The following -is how we expect the compiler, application and kernel to work together. - -1) Application developer compiles with -fmpx. The compiler will add the - instrumentation as well as some setup code called early after the app - starts. New instruction prefixes are noops for old CPUs. -2) That setup code allocates (virtual) space for the "bounds directory", - points the "bndcfgu" register to the directory (must also set the valid - bit) and notifies the kernel (via the new prctl(PR_MPX_ENABLE_MANAGEMENT)) - that the app will be using MPX. The app must be careful not to access - the bounds tables between the time when it populates "bndcfgu" and - when it calls the prctl(). This might be hard to guarantee if the app - is compiled with MPX. You can add "__attribute__((bnd_legacy))" to - the function to disable MPX instrumentation to help guarantee this. - Also be careful not to call out to any other code which might be - MPX-instrumented. -3) The kernel detects that the CPU has MPX, allows the new prctl() to - succeed, and notes the location of the bounds directory. Userspace is - expected to keep the bounds directory at that location. We note it - instead of reading it each time because the 'xsave' operation needed - to access the bounds directory register is an expensive operation. -4) If the application needs to spill bounds out of the 4 registers, it - issues a bndstx instruction. Since the bounds directory is empty at - this point, a bounds fault (#BR) is raised, the kernel allocates a - bounds table (in the user address space) and makes the relevant entry - in the bounds directory point to the new table. -5) If the application violates the bounds specified in the bounds registers, - a separate kind of #BR is raised which will deliver a signal with - information about the violation in the 'struct siginfo'. -6) Whenever memory is freed, we know that it can no longer contain valid - pointers, and we attempt to free the associated space in the bounds - tables. If an entire table becomes unused, we will attempt to free - the table and remove the entry in the directory. - -To summarize, there are essentially three things interacting here: - -GCC with -fmpx: - * enables annotation of code with MPX instructions and prefixes - * inserts code early in the application to call in to the "gcc runtime" -GCC MPX Runtime: - * Checks for hardware MPX support in cpuid leaf - * allocates virtual space for the bounds directory (malloc() essentially) - * points the hardware BNDCFGU register at the directory - * calls a new prctl(PR_MPX_ENABLE_MANAGEMENT) to notify the kernel to - start managing the bounds directories -Kernel MPX Code: - * Checks for hardware MPX support in cpuid leaf - * Handles #BR exceptions and sends SIGSEGV to the app when it violates - bounds, like during a buffer overflow. - * When bounds are spilled in to an unallocated bounds table, the kernel - notices in the #BR exception, allocates the virtual space, then - updates the bounds directory to point to the new table. It keeps - special track of the memory with a VM_MPX flag. - * Frees unused bounds tables at the time that the memory they described - is unmapped. - - -3. How does MPX kernel code work -================================ - -Handling #BR faults caused by MPX ---------------------------------- - -When MPX is enabled, there are 2 new situations that can generate -#BR faults. - * new bounds tables (BT) need to be allocated to save bounds. - * bounds violation caused by MPX instructions. - -We hook #BR handler to handle these two new situations. - -On-demand kernel allocation of bounds tables --------------------------------------------- - -MPX only has 4 hardware registers for storing bounds information. If -MPX-enabled code needs more than these 4 registers, it needs to spill -them somewhere. It has two special instructions for this which allow -the bounds to be moved between the bounds registers and some new "bounds -tables". - -#BR exceptions are a new class of exceptions just for MPX. They are -similar conceptually to a page fault and will be raised by the MPX -hardware during both bounds violations or when the tables are not -present. The kernel handles those #BR exceptions for not-present tables -by carving the space out of the normal processes address space and then -pointing the bounds-directory over to it. - -The tables need to be accessed and controlled by userspace because -the instructions for moving bounds in and out of them are extremely -frequent. They potentially happen every time a register points to -memory. Any direct kernel involvement (like a syscall) to access the -tables would obviously destroy performance. - -Why not do this in userspace? MPX does not strictly require anything in -the kernel. It can theoretically be done completely from userspace. Here -are a few ways this could be done. We don't think any of them are practical -in the real-world, but here they are. - -Q: Can virtual space simply be reserved for the bounds tables so that we - never have to allocate them? -A: MPX-enabled application will possibly create a lot of bounds tables in - process address space to save bounds information. These tables can take - up huge swaths of memory (as much as 80% of the memory on the system) - even if we clean them up aggressively. In the worst-case scenario, the - tables can be 4x the size of the data structure being tracked. IOW, a - 1-page structure can require 4 bounds-table pages. An X-GB virtual - area needs 4*X GB of virtual space, plus 2GB for the bounds directory. - If we were to preallocate them for the 128TB of user virtual address - space, we would need to reserve 512TB+2GB, which is larger than the - entire virtual address space today. This means they can not be reserved - ahead of time. Also, a single process's pre-populated bounds directory - consumes 2GB of virtual *AND* physical memory. IOW, it's completely - infeasible to prepopulate bounds directories. - -Q: Can we preallocate bounds table space at the same time memory is - allocated which might contain pointers that might eventually need - bounds tables? -A: This would work if we could hook the site of each and every memory - allocation syscall. This can be done for small, constrained applications. - But, it isn't practical at a larger scale since a given app has no - way of controlling how all the parts of the app might allocate memory - (think libraries). The kernel is really the only place to intercept - these calls. - -Q: Could a bounds fault be handed to userspace and the tables allocated - there in a signal handler instead of in the kernel? -A: mmap() is not on the list of safe async handler functions and even - if mmap() would work it still requires locking or nasty tricks to - keep track of the allocation state there. - -Having ruled out all of the userspace-only approaches for managing -bounds tables that we could think of, we create them on demand in -the kernel. - -Decoding MPX instructions -------------------------- - -If a #BR is generated due to a bounds violation caused by MPX. -We need to decode MPX instructions to get violation address and -set this address into extended struct siginfo. - -The _sigfault field of struct siginfo is extended as follow: - -87 /* SIGILL, SIGFPE, SIGSEGV, SIGBUS */ -88 struct { -89 void __user *_addr; /* faulting insn/memory ref. */ -90 #ifdef __ARCH_SI_TRAPNO -91 int _trapno; /* TRAP # which caused the signal */ -92 #endif -93 short _addr_lsb; /* LSB of the reported address */ -94 struct { -95 void __user *_lower; -96 void __user *_upper; -97 } _addr_bnd; -98 } _sigfault; - -The '_addr' field refers to violation address, and new '_addr_and' -field refers to the upper/lower bounds when a #BR is caused. - -Glibc will be also updated to support this new siginfo. So user -can get violation address and bounds when bounds violations occur. - -Cleanup unused bounds tables ----------------------------- - -When a BNDSTX instruction attempts to save bounds to a bounds directory -entry marked as invalid, a #BR is generated. This is an indication that -no bounds table exists for this entry. In this case the fault handler -will allocate a new bounds table on demand. - -Since the kernel allocated those tables on-demand without userspace -knowledge, it is also responsible for freeing them when the associated -mappings go away. - -Here, the solution for this issue is to hook do_munmap() to check -whether one process is MPX enabled. If yes, those bounds tables covered -in the virtual address region which is being unmapped will be freed also. - -Adding new prctl commands -------------------------- - -Two new prctl commands are added to enable and disable MPX bounds tables -management in kernel. - -155 #define PR_MPX_ENABLE_MANAGEMENT 43 -156 #define PR_MPX_DISABLE_MANAGEMENT 44 - -Runtime library in userspace is responsible for allocation of bounds -directory. So kernel have to use XSAVE instruction to get the base -of bounds directory from BNDCFG register. - -But XSAVE is expected to be very expensive. In order to do performance -optimization, we have to get the base of bounds directory and save it -into struct mm_struct to be used in future during PR_MPX_ENABLE_MANAGEMENT -command execution. - - -4. Special rules -================ - -1) If userspace is requesting help from the kernel to do the management -of bounds tables, it may not create or modify entries in the bounds directory. - -Certainly users can allocate bounds tables and forcibly point the bounds -directory at them through XSAVE instruction, and then set valid bit -of bounds entry to have this entry valid. But, the kernel will decline -to assist in managing these tables. - -2) Userspace may not take multiple bounds directory entries and point -them at the same bounds table. - -This is allowed architecturally. See more information "Intel(R) Architecture -Instruction Set Extensions Programming Reference" (9.3.4). - -However, if users did this, the kernel might be fooled in to unmapping an -in-use bounds table since it does not recognize sharing. -- cgit v1.2.3 From 0c7180f2e4e6fc02f268e18c4f753d9f9cdfb5ad Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:28 +0800 Subject: Documentation: x86: convert amd-memory-encryption.txt to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/amd-memory-encryption.rst | 97 +++++++++++++++++++++++++++++ Documentation/x86/amd-memory-encryption.txt | 90 -------------------------- Documentation/x86/index.rst | 1 + 3 files changed, 98 insertions(+), 90 deletions(-) create mode 100644 Documentation/x86/amd-memory-encryption.rst delete mode 100644 Documentation/x86/amd-memory-encryption.txt (limited to 'Documentation/x86') diff --git a/Documentation/x86/amd-memory-encryption.rst b/Documentation/x86/amd-memory-encryption.rst new file mode 100644 index 000000000000..c48d452d0718 --- /dev/null +++ b/Documentation/x86/amd-memory-encryption.rst @@ -0,0 +1,97 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +AMD Memory Encryption +===================== + +Secure Memory Encryption (SME) and Secure Encrypted Virtualization (SEV) are +features found on AMD processors. + +SME provides the ability to mark individual pages of memory as encrypted using +the standard x86 page tables. A page that is marked encrypted will be +automatically decrypted when read from DRAM and encrypted when written to +DRAM. SME can therefore be used to protect the contents of DRAM from physical +attacks on the system. + +SEV enables running encrypted virtual machines (VMs) in which the code and data +of the guest VM are secured so that a decrypted version is available only +within the VM itself. SEV guest VMs have the concept of private and shared +memory. Private memory is encrypted with the guest-specific key, while shared +memory may be encrypted with hypervisor key. When SME is enabled, the hypervisor +key is the same key which is used in SME. + +A page is encrypted when a page table entry has the encryption bit set (see +below on how to determine its position). The encryption bit can also be +specified in the cr3 register, allowing the PGD table to be encrypted. Each +successive level of page tables can also be encrypted by setting the encryption +bit in the page table entry that points to the next table. This allows the full +page table hierarchy to be encrypted. Note, this means that just because the +encryption bit is set in cr3, doesn't imply the full hierarchy is encrypted. +Each page table entry in the hierarchy needs to have the encryption bit set to +achieve that. So, theoretically, you could have the encryption bit set in cr3 +so that the PGD is encrypted, but not set the encryption bit in the PGD entry +for a PUD which results in the PUD pointed to by that entry to not be +encrypted. + +When SEV is enabled, instruction pages and guest page tables are always treated +as private. All the DMA operations inside the guest must be performed on shared +memory. Since the memory encryption bit is controlled by the guest OS when it +is operating in 64-bit or 32-bit PAE mode, in all other modes the SEV hardware +forces the memory encryption bit to 1. + +Support for SME and SEV can be determined through the CPUID instruction. The +CPUID function 0x8000001f reports information related to SME:: + + 0x8000001f[eax]: + Bit[0] indicates support for SME + Bit[1] indicates support for SEV + 0x8000001f[ebx]: + Bits[5:0] pagetable bit number used to activate memory + encryption + Bits[11:6] reduction in physical address space, in bits, when + memory encryption is enabled (this only affects + system physical addresses, not guest physical + addresses) + +If support for SME is present, MSR 0xc00100010 (MSR_K8_SYSCFG) can be used to +determine if SME is enabled and/or to enable memory encryption:: + + 0xc0010010: + Bit[23] 0 = memory encryption features are disabled + 1 = memory encryption features are enabled + +If SEV is supported, MSR 0xc0010131 (MSR_AMD64_SEV) can be used to determine if +SEV is active:: + + 0xc0010131: + Bit[0] 0 = memory encryption is not active + 1 = memory encryption is active + +Linux relies on BIOS to set this bit if BIOS has determined that the reduction +in the physical address space as a result of enabling memory encryption (see +CPUID information above) will not conflict with the address space resource +requirements for the system. If this bit is not set upon Linux startup then +Linux itself will not set it and memory encryption will not be possible. + +The state of SME in the Linux kernel can be documented as follows: + + - Supported: + The CPU supports SME (determined through CPUID instruction). + + - Enabled: + Supported and bit 23 of MSR_K8_SYSCFG is set. + + - Active: + Supported, Enabled and the Linux kernel is actively applying + the encryption bit to page table entries (the SME mask in the + kernel is non-zero). + +SME can also be enabled and activated in the BIOS. If SME is enabled and +activated in the BIOS, then all memory accesses will be encrypted and it will +not be necessary to activate the Linux memory encryption support. If the BIOS +merely enables SME (sets bit 23 of the MSR_K8_SYSCFG), then Linux can activate +memory encryption by default (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y) or +by supplying mem_encrypt=on on the kernel command line. However, if BIOS does +not enable SME, then Linux will not be able to activate memory encryption, even +if configured to do so by default or the mem_encrypt=on command line parameter +is specified. diff --git a/Documentation/x86/amd-memory-encryption.txt b/Documentation/x86/amd-memory-encryption.txt deleted file mode 100644 index afc41f544dab..000000000000 --- a/Documentation/x86/amd-memory-encryption.txt +++ /dev/null @@ -1,90 +0,0 @@ -Secure Memory Encryption (SME) and Secure Encrypted Virtualization (SEV) are -features found on AMD processors. - -SME provides the ability to mark individual pages of memory as encrypted using -the standard x86 page tables. A page that is marked encrypted will be -automatically decrypted when read from DRAM and encrypted when written to -DRAM. SME can therefore be used to protect the contents of DRAM from physical -attacks on the system. - -SEV enables running encrypted virtual machines (VMs) in which the code and data -of the guest VM are secured so that a decrypted version is available only -within the VM itself. SEV guest VMs have the concept of private and shared -memory. Private memory is encrypted with the guest-specific key, while shared -memory may be encrypted with hypervisor key. When SME is enabled, the hypervisor -key is the same key which is used in SME. - -A page is encrypted when a page table entry has the encryption bit set (see -below on how to determine its position). The encryption bit can also be -specified in the cr3 register, allowing the PGD table to be encrypted. Each -successive level of page tables can also be encrypted by setting the encryption -bit in the page table entry that points to the next table. This allows the full -page table hierarchy to be encrypted. Note, this means that just because the -encryption bit is set in cr3, doesn't imply the full hierarchy is encrypted. -Each page table entry in the hierarchy needs to have the encryption bit set to -achieve that. So, theoretically, you could have the encryption bit set in cr3 -so that the PGD is encrypted, but not set the encryption bit in the PGD entry -for a PUD which results in the PUD pointed to by that entry to not be -encrypted. - -When SEV is enabled, instruction pages and guest page tables are always treated -as private. All the DMA operations inside the guest must be performed on shared -memory. Since the memory encryption bit is controlled by the guest OS when it -is operating in 64-bit or 32-bit PAE mode, in all other modes the SEV hardware -forces the memory encryption bit to 1. - -Support for SME and SEV can be determined through the CPUID instruction. The -CPUID function 0x8000001f reports information related to SME: - - 0x8000001f[eax]: - Bit[0] indicates support for SME - Bit[1] indicates support for SEV - 0x8000001f[ebx]: - Bits[5:0] pagetable bit number used to activate memory - encryption - Bits[11:6] reduction in physical address space, in bits, when - memory encryption is enabled (this only affects - system physical addresses, not guest physical - addresses) - -If support for SME is present, MSR 0xc00100010 (MSR_K8_SYSCFG) can be used to -determine if SME is enabled and/or to enable memory encryption: - - 0xc0010010: - Bit[23] 0 = memory encryption features are disabled - 1 = memory encryption features are enabled - -If SEV is supported, MSR 0xc0010131 (MSR_AMD64_SEV) can be used to determine if -SEV is active: - - 0xc0010131: - Bit[0] 0 = memory encryption is not active - 1 = memory encryption is active - -Linux relies on BIOS to set this bit if BIOS has determined that the reduction -in the physical address space as a result of enabling memory encryption (see -CPUID information above) will not conflict with the address space resource -requirements for the system. If this bit is not set upon Linux startup then -Linux itself will not set it and memory encryption will not be possible. - -The state of SME in the Linux kernel can be documented as follows: - - Supported: - The CPU supports SME (determined through CPUID instruction). - - - Enabled: - Supported and bit 23 of MSR_K8_SYSCFG is set. - - - Active: - Supported, Enabled and the Linux kernel is actively applying - the encryption bit to page table entries (the SME mask in the - kernel is non-zero). - -SME can also be enabled and activated in the BIOS. If SME is enabled and -activated in the BIOS, then all memory accesses will be encrypted and it will -not be necessary to activate the Linux memory encryption support. If the BIOS -merely enables SME (sets bit 23 of the MSR_K8_SYSCFG), then Linux can activate -memory encryption by default (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y) or -by supplying mem_encrypt=on on the kernel command line. However, if BIOS does -not enable SME, then Linux will not be able to activate memory encryption, even -if configured to do so by default or the mem_encrypt=on command line parameter -is specified. diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index b5cdc0d889b3..85f1f44cc8ac 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -20,3 +20,4 @@ x86-specific Documentation pat protection-keys intel_mpx + amd-memory-encryption -- cgit v1.2.3 From ea0765e835e084707abb7e14d84b572f5ffe4242 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:29 +0800 Subject: Documentation: x86: convert pti.txt to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/index.rst | 1 + Documentation/x86/pti.rst | 195 ++++++++++++++++++++++++++++++++++++++++++++ Documentation/x86/pti.txt | 186 ------------------------------------------ 3 files changed, 196 insertions(+), 186 deletions(-) create mode 100644 Documentation/x86/pti.rst delete mode 100644 Documentation/x86/pti.txt (limited to 'Documentation/x86') diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index 85f1f44cc8ac..6719defc16f8 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -21,3 +21,4 @@ x86-specific Documentation protection-keys intel_mpx amd-memory-encryption + pti diff --git a/Documentation/x86/pti.rst b/Documentation/x86/pti.rst new file mode 100644 index 000000000000..4b858a9bad8d --- /dev/null +++ b/Documentation/x86/pti.rst @@ -0,0 +1,195 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +Page Table Isolation (PTI) +========================== + +Overview +======== + +Page Table Isolation (pti, previously known as KAISER [1]_) is a +countermeasure against attacks on the shared user/kernel address +space such as the "Meltdown" approach [2]_. + +To mitigate this class of attacks, we create an independent set of +page tables for use only when running userspace applications. When +the kernel is entered via syscalls, interrupts or exceptions, the +page tables are switched to the full "kernel" copy. When the system +switches back to user mode, the user copy is used again. + +The userspace page tables contain only a minimal amount of kernel +data: only what is needed to enter/exit the kernel such as the +entry/exit functions themselves and the interrupt descriptor table +(IDT). There are a few strictly unnecessary things that get mapped +such as the first C function when entering an interrupt (see +comments in pti.c). + +This approach helps to ensure that side-channel attacks leveraging +the paging structures do not function when PTI is enabled. It can be +enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time. +Once enabled at compile-time, it can be disabled at boot with the +'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt). + +Page Table Management +===================== + +When PTI is enabled, the kernel manages two sets of page tables. +The first set is very similar to the single set which is present in +kernels without PTI. This includes a complete mapping of userspace +that the kernel can use for things like copy_to_user(). + +Although _complete_, the user portion of the kernel page tables is +crippled by setting the NX bit in the top level. This ensures +that any missed kernel->user CR3 switch will immediately crash +userspace upon executing its first instruction. + +The userspace page tables map only the kernel data needed to enter +and exit the kernel. This data is entirely contained in the 'struct +cpu_entry_area' structure which is placed in the fixmap which gives +each CPU's copy of the area a compile-time-fixed virtual address. + +For new userspace mappings, the kernel makes the entries in its +page tables like normal. The only difference is when the kernel +makes entries in the top (PGD) level. In addition to setting the +entry in the main kernel PGD, a copy of the entry is made in the +userspace page tables' PGD. + +This sharing at the PGD level also inherently shares all the lower +layers of the page tables. This leaves a single, shared set of +userspace page tables to manage. One PTE to lock, one set of +accessed bits, dirty bits, etc... + +Overhead +======== + +Protection against side-channel attacks is important. But, +this protection comes at a cost: + +1. Increased Memory Use + + a. Each process now needs an order-1 PGD instead of order-0. + (Consumes an additional 4k per process). + b. The 'cpu_entry_area' structure must be 2MB in size and 2MB + aligned so that it can be mapped by setting a single PMD + entry. This consumes nearly 2MB of RAM once the kernel + is decompressed, but no space in the kernel image itself. + +2. Runtime Cost + + a. CR3 manipulation to switch between the page table copies + must be done at interrupt, syscall, and exception entry + and exit (it can be skipped when the kernel is interrupted, + though.) Moves to CR3 are on the order of a hundred + cycles, and are required at every entry and exit. + b. A "trampoline" must be used for SYSCALL entry. This + trampoline depends on a smaller set of resources than the + non-PTI SYSCALL entry code, so requires mapping fewer + things into the userspace page tables. The downside is + that stacks must be switched at entry time. + c. Global pages are disabled for all kernel structures not + mapped into both kernel and userspace page tables. This + feature of the MMU allows different processes to share TLB + entries mapping the kernel. Losing the feature means more + TLB misses after a context switch. The actual loss of + performance is very small, however, never exceeding 1%. + d. Process Context IDentifiers (PCID) is a CPU feature that + allows us to skip flushing the entire TLB when switching page + tables by setting a special bit in CR3 when the page tables + are changed. This makes switching the page tables (at context + switch, or kernel entry/exit) cheaper. But, on systems with + PCID support, the context switch code must flush both the user + and kernel entries out of the TLB. The user PCID TLB flush is + deferred until the exit to userspace, minimizing the cost. + See intel.com/sdm for the gory PCID/INVPCID details. + e. The userspace page tables must be populated for each new + process. Even without PTI, the shared kernel mappings + are created by copying top-level (PGD) entries into each + new process. But, with PTI, there are now *two* kernel + mappings: one in the kernel page tables that maps everything + and one for the entry/exit structures. At fork(), we need to + copy both. + f. In addition to the fork()-time copying, there must also + be an update to the userspace PGD any time a set_pgd() is done + on a PGD used to map userspace. This ensures that the kernel + and userspace copies always map the same userspace + memory. + g. On systems without PCID support, each CR3 write flushes + the entire TLB. That means that each syscall, interrupt + or exception flushes the TLB. + h. INVPCID is a TLB-flushing instruction which allows flushing + of TLB entries for non-current PCIDs. Some systems support + PCIDs, but do not support INVPCID. On these systems, addresses + can only be flushed from the TLB for the current PCID. When + flushing a kernel address, we need to flush all PCIDs, so a + single kernel address flush will require a TLB-flushing CR3 + write upon the next use of every PCID. + +Possible Future Work +==================== +1. We can be more careful about not actually writing to CR3 + unless its value is actually changed. +2. Allow PTI to be enabled/disabled at runtime in addition to the + boot-time switching. + +Testing +======== + +To test stability of PTI, the following test procedure is recommended, +ideally doing all of these in parallel: + +1. Set CONFIG_DEBUG_ENTRY=y +2. Run several copies of all of the tools/testing/selftests/x86/ tests + (excluding MPX and protection_keys) in a loop on multiple CPUs for + several minutes. These tests frequently uncover corner cases in the + kernel entry code. In general, old kernels might cause these tests + themselves to crash, but they should never crash the kernel. +3. Run the 'perf' tool in a mode (top or record) that generates many + frequent performance monitoring non-maskable interrupts (see "NMI" + in /proc/interrupts). This exercises the NMI entry/exit code which + is known to trigger bugs in code paths that did not expect to be + interrupted, including nested NMIs. Using "-c" boosts the rate of + NMIs, and using two -c with separate counters encourages nested NMIs + and less deterministic behavior. + :: + + while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done + +4. Launch a KVM virtual machine. +5. Run 32-bit binaries on systems supporting the SYSCALL instruction. + This has been a lightly-tested code path and needs extra scrutiny. + +Debugging +========= + +Bugs in PTI cause a few different signatures of crashes +that are worth noting here. + + * Failures of the selftests/x86 code. Usually a bug in one of the + more obscure corners of entry_64.S + * Crashes in early boot, especially around CPU bringup. Bugs + in the trampoline code or mappings cause these. + * Crashes at the first interrupt. Caused by bugs in entry_64.S, + like screwing up a page table switch. Also caused by + incorrectly mapping the IRQ handler entry code. + * Crashes at the first NMI. The NMI code is separate from main + interrupt handlers and can have bugs that do not affect + normal interrupts. Also caused by incorrectly mapping NMI + code. NMIs that interrupt the entry code must be very + careful and can be the cause of crashes that show up when + running perf. + * Kernel crashes at the first exit to userspace. entry_64.S + bugs, or failing to map some of the exit code. + * Crashes at first interrupt that interrupts userspace. The paths + in entry_64.S that return to userspace are sometimes separate + from the ones that return to the kernel. + * Double faults: overflowing the kernel stack because of page + faults upon page faults. Caused by touching non-pti-mapped + data in the entry code, or forgetting to switch to kernel + CR3 before calling into C functions which are not pti-mapped. + * Userspace segfaults early in boot, sometimes manifesting + as mount(8) failing to mount the rootfs. These have + tended to be TLB invalidation issues. Usually invalidating + the wrong PCID, or otherwise missing an invalidation. + +.. [1] https://gruss.cc/files/kaiser.pdf +.. [2] https://meltdownattack.com/meltdown.pdf diff --git a/Documentation/x86/pti.txt b/Documentation/x86/pti.txt deleted file mode 100644 index 5cd58439ad2d..000000000000 --- a/Documentation/x86/pti.txt +++ /dev/null @@ -1,186 +0,0 @@ -Overview -======== - -Page Table Isolation (pti, previously known as KAISER[1]) is a -countermeasure against attacks on the shared user/kernel address -space such as the "Meltdown" approach[2]. - -To mitigate this class of attacks, we create an independent set of -page tables for use only when running userspace applications. When -the kernel is entered via syscalls, interrupts or exceptions, the -page tables are switched to the full "kernel" copy. When the system -switches back to user mode, the user copy is used again. - -The userspace page tables contain only a minimal amount of kernel -data: only what is needed to enter/exit the kernel such as the -entry/exit functions themselves and the interrupt descriptor table -(IDT). There are a few strictly unnecessary things that get mapped -such as the first C function when entering an interrupt (see -comments in pti.c). - -This approach helps to ensure that side-channel attacks leveraging -the paging structures do not function when PTI is enabled. It can be -enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time. -Once enabled at compile-time, it can be disabled at boot with the -'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt). - -Page Table Management -===================== - -When PTI is enabled, the kernel manages two sets of page tables. -The first set is very similar to the single set which is present in -kernels without PTI. This includes a complete mapping of userspace -that the kernel can use for things like copy_to_user(). - -Although _complete_, the user portion of the kernel page tables is -crippled by setting the NX bit in the top level. This ensures -that any missed kernel->user CR3 switch will immediately crash -userspace upon executing its first instruction. - -The userspace page tables map only the kernel data needed to enter -and exit the kernel. This data is entirely contained in the 'struct -cpu_entry_area' structure which is placed in the fixmap which gives -each CPU's copy of the area a compile-time-fixed virtual address. - -For new userspace mappings, the kernel makes the entries in its -page tables like normal. The only difference is when the kernel -makes entries in the top (PGD) level. In addition to setting the -entry in the main kernel PGD, a copy of the entry is made in the -userspace page tables' PGD. - -This sharing at the PGD level also inherently shares all the lower -layers of the page tables. This leaves a single, shared set of -userspace page tables to manage. One PTE to lock, one set of -accessed bits, dirty bits, etc... - -Overhead -======== - -Protection against side-channel attacks is important. But, -this protection comes at a cost: - -1. Increased Memory Use - a. Each process now needs an order-1 PGD instead of order-0. - (Consumes an additional 4k per process). - b. The 'cpu_entry_area' structure must be 2MB in size and 2MB - aligned so that it can be mapped by setting a single PMD - entry. This consumes nearly 2MB of RAM once the kernel - is decompressed, but no space in the kernel image itself. - -2. Runtime Cost - a. CR3 manipulation to switch between the page table copies - must be done at interrupt, syscall, and exception entry - and exit (it can be skipped when the kernel is interrupted, - though.) Moves to CR3 are on the order of a hundred - cycles, and are required at every entry and exit. - b. A "trampoline" must be used for SYSCALL entry. This - trampoline depends on a smaller set of resources than the - non-PTI SYSCALL entry code, so requires mapping fewer - things into the userspace page tables. The downside is - that stacks must be switched at entry time. - c. Global pages are disabled for all kernel structures not - mapped into both kernel and userspace page tables. This - feature of the MMU allows different processes to share TLB - entries mapping the kernel. Losing the feature means more - TLB misses after a context switch. The actual loss of - performance is very small, however, never exceeding 1%. - d. Process Context IDentifiers (PCID) is a CPU feature that - allows us to skip flushing the entire TLB when switching page - tables by setting a special bit in CR3 when the page tables - are changed. This makes switching the page tables (at context - switch, or kernel entry/exit) cheaper. But, on systems with - PCID support, the context switch code must flush both the user - and kernel entries out of the TLB. The user PCID TLB flush is - deferred until the exit to userspace, minimizing the cost. - See intel.com/sdm for the gory PCID/INVPCID details. - e. The userspace page tables must be populated for each new - process. Even without PTI, the shared kernel mappings - are created by copying top-level (PGD) entries into each - new process. But, with PTI, there are now *two* kernel - mappings: one in the kernel page tables that maps everything - and one for the entry/exit structures. At fork(), we need to - copy both. - f. In addition to the fork()-time copying, there must also - be an update to the userspace PGD any time a set_pgd() is done - on a PGD used to map userspace. This ensures that the kernel - and userspace copies always map the same userspace - memory. - g. On systems without PCID support, each CR3 write flushes - the entire TLB. That means that each syscall, interrupt - or exception flushes the TLB. - h. INVPCID is a TLB-flushing instruction which allows flushing - of TLB entries for non-current PCIDs. Some systems support - PCIDs, but do not support INVPCID. On these systems, addresses - can only be flushed from the TLB for the current PCID. When - flushing a kernel address, we need to flush all PCIDs, so a - single kernel address flush will require a TLB-flushing CR3 - write upon the next use of every PCID. - -Possible Future Work -==================== -1. We can be more careful about not actually writing to CR3 - unless its value is actually changed. -2. Allow PTI to be enabled/disabled at runtime in addition to the - boot-time switching. - -Testing -======== - -To test stability of PTI, the following test procedure is recommended, -ideally doing all of these in parallel: - -1. Set CONFIG_DEBUG_ENTRY=y -2. Run several copies of all of the tools/testing/selftests/x86/ tests - (excluding MPX and protection_keys) in a loop on multiple CPUs for - several minutes. These tests frequently uncover corner cases in the - kernel entry code. In general, old kernels might cause these tests - themselves to crash, but they should never crash the kernel. -3. Run the 'perf' tool in a mode (top or record) that generates many - frequent performance monitoring non-maskable interrupts (see "NMI" - in /proc/interrupts). This exercises the NMI entry/exit code which - is known to trigger bugs in code paths that did not expect to be - interrupted, including nested NMIs. Using "-c" boosts the rate of - NMIs, and using two -c with separate counters encourages nested NMIs - and less deterministic behavior. - - while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done - -4. Launch a KVM virtual machine. -5. Run 32-bit binaries on systems supporting the SYSCALL instruction. - This has been a lightly-tested code path and needs extra scrutiny. - -Debugging -========= - -Bugs in PTI cause a few different signatures of crashes -that are worth noting here. - - * Failures of the selftests/x86 code. Usually a bug in one of the - more obscure corners of entry_64.S - * Crashes in early boot, especially around CPU bringup. Bugs - in the trampoline code or mappings cause these. - * Crashes at the first interrupt. Caused by bugs in entry_64.S, - like screwing up a page table switch. Also caused by - incorrectly mapping the IRQ handler entry code. - * Crashes at the first NMI. The NMI code is separate from main - interrupt handlers and can have bugs that do not affect - normal interrupts. Also caused by incorrectly mapping NMI - code. NMIs that interrupt the entry code must be very - careful and can be the cause of crashes that show up when - running perf. - * Kernel crashes at the first exit to userspace. entry_64.S - bugs, or failing to map some of the exit code. - * Crashes at first interrupt that interrupts userspace. The paths - in entry_64.S that return to userspace are sometimes separate - from the ones that return to the kernel. - * Double faults: overflowing the kernel stack because of page - faults upon page faults. Caused by touching non-pti-mapped - data in the entry code, or forgetting to switch to kernel - CR3 before calling into C functions which are not pti-mapped. - * Userspace segfaults early in boot, sometimes manifesting - as mount(8) failing to mount the rootfs. These have - tended to be TLB invalidation issues. Usually invalidating - the wrong PCID, or otherwise missing an invalidation. - -1. https://gruss.cc/files/kaiser.pdf -2. https://meltdownattack.com/meltdown.pdf -- cgit v1.2.3 From 3d07bc393f9b63ca4c6f9953922f9122a11f29c3 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:30 +0800 Subject: Documentation: x86: convert microcode.txt to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/index.rst | 1 + Documentation/x86/microcode.rst | 142 ++++++++++++++++++++++++++++++++++++++++ Documentation/x86/microcode.txt | 136 -------------------------------------- 3 files changed, 143 insertions(+), 136 deletions(-) create mode 100644 Documentation/x86/microcode.rst delete mode 100644 Documentation/x86/microcode.txt (limited to 'Documentation/x86') diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index 6719defc16f8..ae29c026be72 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -22,3 +22,4 @@ x86-specific Documentation intel_mpx amd-memory-encryption pti + microcode diff --git a/Documentation/x86/microcode.rst b/Documentation/x86/microcode.rst new file mode 100644 index 000000000000..a320d37982ed --- /dev/null +++ b/Documentation/x86/microcode.rst @@ -0,0 +1,142 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +The Linux Microcode Loader +========================== + +:Authors: - Fenghua Yu + - Borislav Petkov + +The kernel has a x86 microcode loading facility which is supposed to +provide microcode loading methods in the OS. Potential use cases are +updating the microcode on platforms beyond the OEM End-Of-Life support, +and updating the microcode on long-running systems without rebooting. + +The loader supports three loading methods: + +Early load microcode +==================== + +The kernel can update microcode very early during boot. Loading +microcode early can fix CPU issues before they are observed during +kernel boot time. + +The microcode is stored in an initrd file. During boot, it is read from +it and loaded into the CPU cores. + +The format of the combined initrd image is microcode in (uncompressed) +cpio format followed by the (possibly compressed) initrd image. The +loader parses the combined initrd image during boot. + +The microcode files in cpio name space are: + +on Intel: + kernel/x86/microcode/GenuineIntel.bin +on AMD : + kernel/x86/microcode/AuthenticAMD.bin + +During BSP (BootStrapping Processor) boot (pre-SMP), the kernel +scans the microcode file in the initrd. If microcode matching the +CPU is found, it will be applied in the BSP and later on in all APs +(Application Processors). + +The loader also saves the matching microcode for the CPU in memory. +Thus, the cached microcode patch is applied when CPUs resume from a +sleep state. + +Here's a crude example how to prepare an initrd with microcode (this is +normally done automatically by the distribution, when recreating the +initrd, so you don't really have to do it yourself. It is documented +here for future reference only). +:: + + #!/bin/bash + + if [ -z "$1" ]; then + echo "You need to supply an initrd file" + exit 1 + fi + + INITRD="$1" + + DSTDIR=kernel/x86/microcode + TMPDIR=/tmp/initrd + + rm -rf $TMPDIR + + mkdir $TMPDIR + cd $TMPDIR + mkdir -p $DSTDIR + + if [ -d /lib/firmware/amd-ucode ]; then + cat /lib/firmware/amd-ucode/microcode_amd*.bin > $DSTDIR/AuthenticAMD.bin + fi + + if [ -d /lib/firmware/intel-ucode ]; then + cat /lib/firmware/intel-ucode/* > $DSTDIR/GenuineIntel.bin + fi + + find . | cpio -o -H newc >../ucode.cpio + cd .. + mv $INITRD $INITRD.orig + cat ucode.cpio $INITRD.orig > $INITRD + + rm -rf $TMPDIR + + +The system needs to have the microcode packages installed into +/lib/firmware or you need to fixup the paths above if yours are +somewhere else and/or you've downloaded them directly from the processor +vendor's site. + +Late loading +============ + +There are two legacy user space interfaces to load microcode, either through +/dev/cpu/microcode or through /sys/devices/system/cpu/microcode/reload file +in sysfs. + +The /dev/cpu/microcode method is deprecated because it needs a special +userspace tool for that. + +The easier method is simply installing the microcode packages your distro +supplies and running:: + + # echo 1 > /sys/devices/system/cpu/microcode/reload + +as root. + +The loading mechanism looks for microcode blobs in +/lib/firmware/{intel-ucode,amd-ucode}. The default distro installation +packages already put them there. + +Builtin microcode +================= + +The loader supports also loading of a builtin microcode supplied through +the regular builtin firmware method CONFIG_EXTRA_FIRMWARE. Only 64-bit is +currently supported. + +Here's an example:: + + CONFIG_EXTRA_FIRMWARE="intel-ucode/06-3a-09 amd-ucode/microcode_amd_fam15h.bin" + CONFIG_EXTRA_FIRMWARE_DIR="/lib/firmware" + +This basically means, you have the following tree structure locally:: + + /lib/firmware/ + |-- amd-ucode + ... + | |-- microcode_amd_fam15h.bin + ... + |-- intel-ucode + ... + | |-- 06-3a-09 + ... + +so that the build system can find those files and integrate them into +the final kernel image. The early loader finds them and applies them. + +Needless to say, this method is not the most flexible one because it +requires rebuilding the kernel each time updated microcode from the CPU +vendor is available. diff --git a/Documentation/x86/microcode.txt b/Documentation/x86/microcode.txt deleted file mode 100644 index 79fdb4a8148a..000000000000 --- a/Documentation/x86/microcode.txt +++ /dev/null @@ -1,136 +0,0 @@ - The Linux Microcode Loader - -Authors: Fenghua Yu - Borislav Petkov - -The kernel has a x86 microcode loading facility which is supposed to -provide microcode loading methods in the OS. Potential use cases are -updating the microcode on platforms beyond the OEM End-Of-Life support, -and updating the microcode on long-running systems without rebooting. - -The loader supports three loading methods: - -1. Early load microcode -======================= - -The kernel can update microcode very early during boot. Loading -microcode early can fix CPU issues before they are observed during -kernel boot time. - -The microcode is stored in an initrd file. During boot, it is read from -it and loaded into the CPU cores. - -The format of the combined initrd image is microcode in (uncompressed) -cpio format followed by the (possibly compressed) initrd image. The -loader parses the combined initrd image during boot. - -The microcode files in cpio name space are: - -on Intel: kernel/x86/microcode/GenuineIntel.bin -on AMD : kernel/x86/microcode/AuthenticAMD.bin - -During BSP (BootStrapping Processor) boot (pre-SMP), the kernel -scans the microcode file in the initrd. If microcode matching the -CPU is found, it will be applied in the BSP and later on in all APs -(Application Processors). - -The loader also saves the matching microcode for the CPU in memory. -Thus, the cached microcode patch is applied when CPUs resume from a -sleep state. - -Here's a crude example how to prepare an initrd with microcode (this is -normally done automatically by the distribution, when recreating the -initrd, so you don't really have to do it yourself. It is documented -here for future reference only). - ---- - #!/bin/bash - - if [ -z "$1" ]; then - echo "You need to supply an initrd file" - exit 1 - fi - - INITRD="$1" - - DSTDIR=kernel/x86/microcode - TMPDIR=/tmp/initrd - - rm -rf $TMPDIR - - mkdir $TMPDIR - cd $TMPDIR - mkdir -p $DSTDIR - - if [ -d /lib/firmware/amd-ucode ]; then - cat /lib/firmware/amd-ucode/microcode_amd*.bin > $DSTDIR/AuthenticAMD.bin - fi - - if [ -d /lib/firmware/intel-ucode ]; then - cat /lib/firmware/intel-ucode/* > $DSTDIR/GenuineIntel.bin - fi - - find . | cpio -o -H newc >../ucode.cpio - cd .. - mv $INITRD $INITRD.orig - cat ucode.cpio $INITRD.orig > $INITRD - - rm -rf $TMPDIR ---- - -The system needs to have the microcode packages installed into -/lib/firmware or you need to fixup the paths above if yours are -somewhere else and/or you've downloaded them directly from the processor -vendor's site. - -2. Late loading -=============== - -There are two legacy user space interfaces to load microcode, either through -/dev/cpu/microcode or through /sys/devices/system/cpu/microcode/reload file -in sysfs. - -The /dev/cpu/microcode method is deprecated because it needs a special -userspace tool for that. - -The easier method is simply installing the microcode packages your distro -supplies and running: - -# echo 1 > /sys/devices/system/cpu/microcode/reload - -as root. - -The loading mechanism looks for microcode blobs in -/lib/firmware/{intel-ucode,amd-ucode}. The default distro installation -packages already put them there. - -3. Builtin microcode -==================== - -The loader supports also loading of a builtin microcode supplied through -the regular builtin firmware method CONFIG_EXTRA_FIRMWARE. Only 64-bit is -currently supported. - -Here's an example: - -CONFIG_EXTRA_FIRMWARE="intel-ucode/06-3a-09 amd-ucode/microcode_amd_fam15h.bin" -CONFIG_EXTRA_FIRMWARE_DIR="/lib/firmware" - -This basically means, you have the following tree structure locally: - -/lib/firmware/ -|-- amd-ucode -... -| |-- microcode_amd_fam15h.bin -... -|-- intel-ucode -... -| |-- 06-3a-09 -... - -so that the build system can find those files and integrate them into -the final kernel image. The early loader finds them and applies them. - -Needless to say, this method is not the most flexible one because it -requires rebuilding the kernel each time updated microcode from the CPU -vendor is available. -- cgit v1.2.3 From 1cd7af509dc223905dce622c07ec62e26044e3c0 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:31 +0800 Subject: Documentation: x86: convert resctrl_ui.txt to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/index.rst | 1 + Documentation/x86/resctrl_ui.rst | 1191 ++++++++++++++++++++++++++++++++++++++ Documentation/x86/resctrl_ui.txt | 1121 ----------------------------------- 3 files changed, 1192 insertions(+), 1121 deletions(-) create mode 100644 Documentation/x86/resctrl_ui.rst delete mode 100644 Documentation/x86/resctrl_ui.txt (limited to 'Documentation/x86') diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index ae29c026be72..6e3c887a0c3b 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -23,3 +23,4 @@ x86-specific Documentation amd-memory-encryption pti microcode + resctrl_ui diff --git a/Documentation/x86/resctrl_ui.rst b/Documentation/x86/resctrl_ui.rst new file mode 100644 index 000000000000..225cfd4daaee --- /dev/null +++ b/Documentation/x86/resctrl_ui.rst @@ -0,0 +1,1191 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +=========================================== +User Interface for Resource Control feature +=========================================== + +:Copyright: |copy| 2016 Intel Corporation +:Authors: - Fenghua Yu + - Tony Luck + - Vikas Shivappa + + +Intel refers to this feature as Intel Resource Director Technology(Intel(R) RDT). +AMD refers to this feature as AMD Platform Quality of Service(AMD QoS). + +This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo +flag bits: + +============================================= ================================ +RDT (Resource Director Technology) Allocation "rdt_a" +CAT (Cache Allocation Technology) "cat_l3", "cat_l2" +CDP (Code and Data Prioritization) "cdp_l3", "cdp_l2" +CQM (Cache QoS Monitoring) "cqm_llc", "cqm_occup_llc" +MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local" +MBA (Memory Bandwidth Allocation) "mba" +============================================= ================================ + +To use the feature mount the file system:: + + # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl + +mount options are: + +"cdp": + Enable code/data prioritization in L3 cache allocations. +"cdpl2": + Enable code/data prioritization in L2 cache allocations. +"mba_MBps": + Enable the MBA Software Controller(mba_sc) to specify MBA + bandwidth in MBps + +L2 and L3 CDP are controlled seperately. + +RDT features are orthogonal. A particular system may support only +monitoring, only control, or both monitoring and control. Cache +pseudo-locking is a unique way of using cache control to "pin" or +"lock" data in the cache. Details can be found in +"Cache Pseudo-Locking". + + +The mount succeeds if either of allocation or monitoring is present, but +only those files and directories supported by the system will be created. +For more details on the behavior of the interface during monitoring +and allocation, see the "Resource alloc and monitor groups" section. + +Info directory +============== + +The 'info' directory contains information about the enabled +resources. Each resource has its own subdirectory. The subdirectory +names reflect the resource names. + +Each subdirectory contains the following files with respect to +allocation: + +Cache resource(L3/L2) subdirectory contains the following files +related to allocation: + +"num_closids": + The number of CLOSIDs which are valid for this + resource. The kernel uses the smallest number of + CLOSIDs of all enabled resources as limit. +"cbm_mask": + The bitmask which is valid for this resource. + This mask is equivalent to 100%. +"min_cbm_bits": + The minimum number of consecutive bits which + must be set when writing a mask. + +"shareable_bits": + Bitmask of shareable resource with other executing + entities (e.g. I/O). User can use this when + setting up exclusive cache partitions. Note that + some platforms support devices that have their + own settings for cache use which can over-ride + these bits. +"bit_usage": + Annotated capacity bitmasks showing how all + instances of the resource are used. The legend is: + + "0": + Corresponding region is unused. When the system's + resources have been allocated and a "0" is found + in "bit_usage" it is a sign that resources are + wasted. + + "H": + Corresponding region is used by hardware only + but available for software use. If a resource + has bits set in "shareable_bits" but not all + of these bits appear in the resource groups' + schematas then the bits appearing in + "shareable_bits" but no resource group will + be marked as "H". + "X": + Corresponding region is available for sharing and + used by hardware and software. These are the + bits that appear in "shareable_bits" as + well as a resource group's allocation. + "S": + Corresponding region is used by software + and available for sharing. + "E": + Corresponding region is used exclusively by + one resource group. No sharing allowed. + "P": + Corresponding region is pseudo-locked. No + sharing allowed. + +Memory bandwitdh(MB) subdirectory contains the following files +with respect to allocation: + +"min_bandwidth": + The minimum memory bandwidth percentage which + user can request. + +"bandwidth_gran": + The granularity in which the memory bandwidth + percentage is allocated. The allocated + b/w percentage is rounded off to the next + control step available on the hardware. The + available bandwidth control steps are: + min_bandwidth + N * bandwidth_gran. + +"delay_linear": + Indicates if the delay scale is linear or + non-linear. This field is purely informational + only. + +If RDT monitoring is available there will be an "L3_MON" directory +with the following files: + +"num_rmids": + The number of RMIDs available. This is the + upper bound for how many "CTRL_MON" + "MON" + groups can be created. + +"mon_features": + Lists the monitoring events if + monitoring is enabled for the resource. + +"max_threshold_occupancy": + Read/write file provides the largest value (in + bytes) at which a previously used LLC_occupancy + counter can be considered for re-use. + +Finally, in the top level of the "info" directory there is a file +named "last_cmd_status". This is reset with every "command" issued +via the file system (making new directories or writing to any of the +control files). If the command was successful, it will read as "ok". +If the command failed, it will provide more information that can be +conveyed in the error returns from file operations. E.g. +:: + + # echo L3:0=f7 > schemata + bash: echo: write error: Invalid argument + # cat info/last_cmd_status + mask f7 has non-consecutive 1-bits + +Resource alloc and monitor groups +================================= + +Resource groups are represented as directories in the resctrl file +system. The default group is the root directory which, immediately +after mounting, owns all the tasks and cpus in the system and can make +full use of all resources. + +On a system with RDT control features additional directories can be +created in the root directory that specify different amounts of each +resource (see "schemata" below). The root and these additional top level +directories are referred to as "CTRL_MON" groups below. + +On a system with RDT monitoring the root directory and other top level +directories contain a directory named "mon_groups" in which additional +directories can be created to monitor subsets of tasks in the CTRL_MON +group that is their ancestor. These are called "MON" groups in the rest +of this document. + +Removing a directory will move all tasks and cpus owned by the group it +represents to the parent. Removing one of the created CTRL_MON groups +will automatically remove all MON groups below it. + +All groups contain the following files: + +"tasks": + Reading this file shows the list of all tasks that belong to + this group. Writing a task id to the file will add a task to the + group. If the group is a CTRL_MON group the task is removed from + whichever previous CTRL_MON group owned the task and also from + any MON group that owned the task. If the group is a MON group, + then the task must already belong to the CTRL_MON parent of this + group. The task is removed from any previous MON group. + + +"cpus": + Reading this file shows a bitmask of the logical CPUs owned by + this group. Writing a mask to this file will add and remove + CPUs to/from this group. As with the tasks file a hierarchy is + maintained where MON groups may only include CPUs owned by the + parent CTRL_MON group. + When the resouce group is in pseudo-locked mode this file will + only be readable, reflecting the CPUs associated with the + pseudo-locked region. + + +"cpus_list": + Just like "cpus", only using ranges of CPUs instead of bitmasks. + + +When control is enabled all CTRL_MON groups will also contain: + +"schemata": + A list of all the resources available to this group. + Each resource has its own line and format - see below for details. + +"size": + Mirrors the display of the "schemata" file to display the size in + bytes of each allocation instead of the bits representing the + allocation. + +"mode": + The "mode" of the resource group dictates the sharing of its + allocations. A "shareable" resource group allows sharing of its + allocations while an "exclusive" resource group does not. A + cache pseudo-locked region is created by first writing + "pseudo-locksetup" to the "mode" file before writing the cache + pseudo-locked region's schemata to the resource group's "schemata" + file. On successful pseudo-locked region creation the mode will + automatically change to "pseudo-locked". + +When monitoring is enabled all MON groups will also contain: + +"mon_data": + This contains a set of files organized by L3 domain and by + RDT event. E.g. on a system with two L3 domains there will + be subdirectories "mon_L3_00" and "mon_L3_01". Each of these + directories have one file per event (e.g. "llc_occupancy", + "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these + files provide a read out of the current value of the event for + all tasks in the group. In CTRL_MON groups these files provide + the sum for all tasks in the CTRL_MON group and all tasks in + MON groups. Please see example section for more details on usage. + +Resource allocation rules +------------------------- + +When a task is running the following rules define which resources are +available to it: + +1) If the task is a member of a non-default group, then the schemata + for that group is used. + +2) Else if the task belongs to the default group, but is running on a + CPU that is assigned to some specific group, then the schemata for the + CPU's group is used. + +3) Otherwise the schemata for the default group is used. + +Resource monitoring rules +------------------------- +1) If a task is a member of a MON group, or non-default CTRL_MON group + then RDT events for the task will be reported in that group. + +2) If a task is a member of the default CTRL_MON group, but is running + on a CPU that is assigned to some specific group, then the RDT events + for the task will be reported in that group. + +3) Otherwise RDT events for the task will be reported in the root level + "mon_data" group. + + +Notes on cache occupancy monitoring and control +=============================================== +When moving a task from one group to another you should remember that +this only affects *new* cache allocations by the task. E.g. you may have +a task in a monitor group showing 3 MB of cache occupancy. If you move +to a new group and immediately check the occupancy of the old and new +groups you will likely see that the old group is still showing 3 MB and +the new group zero. When the task accesses locations still in cache from +before the move, the h/w does not update any counters. On a busy system +you will likely see the occupancy in the old group go down as cache lines +are evicted and re-used while the occupancy in the new group rises as +the task accesses memory and loads into the cache are counted based on +membership in the new group. + +The same applies to cache allocation control. Moving a task to a group +with a smaller cache partition will not evict any cache lines. The +process may continue to use them from the old partition. + +Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID) +to identify a control group and a monitoring group respectively. Each of +the resource groups are mapped to these IDs based on the kind of group. The +number of CLOSid and RMID are limited by the hardware and hence the creation of +a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID +and creation of "MON" group may fail if we run out of RMIDs. + +max_threshold_occupancy - generic concepts +------------------------------------------ + +Note that an RMID once freed may not be immediately available for use as +the RMID is still tagged the cache lines of the previous user of RMID. +Hence such RMIDs are placed on limbo list and checked back if the cache +occupancy has gone down. If there is a time when system has a lot of +limbo RMIDs but which are not ready to be used, user may see an -EBUSY +during mkdir. + +max_threshold_occupancy is a user configurable value to determine the +occupancy at which an RMID can be freed. + +Schemata files - general concepts +--------------------------------- +Each line in the file describes one resource. The line starts with +the name of the resource, followed by specific values to be applied +in each of the instances of that resource on the system. + +Cache IDs +--------- +On current generation systems there is one L3 cache per socket and L2 +caches are generally just shared by the hyperthreads on a core, but this +isn't an architectural requirement. We could have multiple separate L3 +caches on a socket, multiple cores could share an L2 cache. So instead +of using "socket" or "core" to define the set of logical cpus sharing +a resource we use a "Cache ID". At a given cache level this will be a +unique number across the whole system (but it isn't guaranteed to be a +contiguous sequence, there may be gaps). To find the ID for each logical +CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id + +Cache Bit Masks (CBM) +--------------------- +For cache resources we describe the portion of the cache that is available +for allocation using a bitmask. The maximum value of the mask is defined +by each cpu model (and may be different for different cache levels). It +is found using CPUID, but is also provided in the "info" directory of +the resctrl file system in "info/{resource}/cbm_mask". X86 hardware +requires that these masks have all the '1' bits in a contiguous block. So +0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9 +and 0xA are not. On a system with a 20-bit mask each bit represents 5% +of the capacity of the cache. You could partition the cache into four +equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000. + +Memory bandwidth Allocation and monitoring +========================================== + +For Memory bandwidth resource, by default the user controls the resource +by indicating the percentage of total memory bandwidth. + +The minimum bandwidth percentage value for each cpu model is predefined +and can be looked up through "info/MB/min_bandwidth". The bandwidth +granularity that is allocated is also dependent on the cpu model and can +be looked up at "info/MB/bandwidth_gran". The available bandwidth +control steps are: min_bw + N * bw_gran. Intermediate values are rounded +to the next control step available on the hardware. + +The bandwidth throttling is a core specific mechanism on some of Intel +SKUs. Using a high bandwidth and a low bandwidth setting on two threads +sharing a core will result in both threads being throttled to use the +low bandwidth. The fact that Memory bandwidth allocation(MBA) is a core +specific mechanism where as memory bandwidth monitoring(MBM) is done at +the package level may lead to confusion when users try to apply control +via the MBA and then monitor the bandwidth to see if the controls are +effective. Below are such scenarios: + +1. User may *not* see increase in actual bandwidth when percentage + values are increased: + +This can occur when aggregate L2 external bandwidth is more than L3 +external bandwidth. Consider an SKL SKU with 24 cores on a package and +where L2 external is 10GBps (hence aggregate L2 external bandwidth is +240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20 +threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3 +bandwidth of 100GBps although the percentage value specified is only 50% +<< 100%. Hence increasing the bandwidth percentage will not yeild any +more bandwidth. This is because although the L2 external bandwidth still +has capacity, the L3 external bandwidth is fully used. Also note that +this would be dependent on number of cores the benchmark is run on. + +2. Same bandwidth percentage may mean different actual bandwidth + depending on # of threads: + +For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4 +thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although +they have same percentage bandwidth of 10%. This is simply because as +threads start using more cores in an rdtgroup, the actual bandwidth may +increase or vary although user specified bandwidth percentage is same. + +In order to mitigate this and make the interface more user friendly, +resctrl added support for specifying the bandwidth in MBps as well. The +kernel underneath would use a software feedback mechanism or a "Software +Controller(mba_sc)" which reads the actual bandwidth using MBM counters +and adjust the memowy bandwidth percentages to ensure:: + + "actual bandwidth < user specified bandwidth". + +By default, the schemata would take the bandwidth percentage values +where as user can switch to the "MBA software controller" mode using +a mount option 'mba_MBps'. The schemata format is specified in the below +sections. + +L3 schemata file details (code and data prioritization disabled) +---------------------------------------------------------------- +With CDP disabled the L3 schemata format is:: + + L3:=;=;... + +L3 schemata file details (CDP enabled via mount option to resctrl) +------------------------------------------------------------------ +When CDP is enabled L3 control is split into two separate resources +so you can specify independent masks for code and data like this:: + + L3data:=;=;... + L3code:=;=;... + +L2 schemata file details +------------------------ +L2 cache does not support code and data prioritization, so the +schemata format is always:: + + L2:=;=;... + +Memory bandwidth Allocation (default mode) +------------------------------------------ + +Memory b/w domain is L3 cache. +:: + + MB:=bandwidth0;=bandwidth1;... + +Memory bandwidth Allocation specified in MBps +--------------------------------------------- + +Memory bandwidth domain is L3 cache. +:: + + MB:=bw_MBps0;=bw_MBps1;... + +Reading/writing the schemata file +--------------------------------- +Reading the schemata file will show the state of all resources +on all domains. When writing you only need to specify those values +which you wish to change. E.g. +:: + + # cat schemata + L3DATA:0=fffff;1=fffff;2=fffff;3=fffff + L3CODE:0=fffff;1=fffff;2=fffff;3=fffff + # echo "L3DATA:2=3c0;" > schemata + # cat schemata + L3DATA:0=fffff;1=fffff;2=3c0;3=fffff + L3CODE:0=fffff;1=fffff;2=fffff;3=fffff + +Cache Pseudo-Locking +==================== +CAT enables a user to specify the amount of cache space that an +application can fill. Cache pseudo-locking builds on the fact that a +CPU can still read and write data pre-allocated outside its current +allocated area on a cache hit. With cache pseudo-locking, data can be +preloaded into a reserved portion of cache that no application can +fill, and from that point on will only serve cache hits. The cache +pseudo-locked memory is made accessible to user space where an +application can map it into its virtual address space and thus have +a region of memory with reduced average read latency. + +The creation of a cache pseudo-locked region is triggered by a request +from the user to do so that is accompanied by a schemata of the region +to be pseudo-locked. The cache pseudo-locked region is created as follows: + +- Create a CAT allocation CLOSNEW with a CBM matching the schemata + from the user of the cache region that will contain the pseudo-locked + memory. This region must not overlap with any current CAT allocation/CLOS + on the system and no future overlap with this cache region is allowed + while the pseudo-locked region exists. +- Create a contiguous region of memory of the same size as the cache + region. +- Flush the cache, disable hardware prefetchers, disable preemption. +- Make CLOSNEW the active CLOS and touch the allocated memory to load + it into the cache. +- Set the previous CLOS as active. +- At this point the closid CLOSNEW can be released - the cache + pseudo-locked region is protected as long as its CBM does not appear in + any CAT allocation. Even though the cache pseudo-locked region will from + this point on not appear in any CBM of any CLOS an application running with + any CLOS will be able to access the memory in the pseudo-locked region since + the region continues to serve cache hits. +- The contiguous region of memory loaded into the cache is exposed to + user-space as a character device. + +Cache pseudo-locking increases the probability that data will remain +in the cache via carefully configuring the CAT feature and controlling +application behavior. There is no guarantee that data is placed in +cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict +“locked” data from cache. Power management C-states may shrink or +power off cache. Deeper C-states will automatically be restricted on +pseudo-locked region creation. + +It is required that an application using a pseudo-locked region runs +with affinity to the cores (or a subset of the cores) associated +with the cache on which the pseudo-locked region resides. A sanity check +within the code will not allow an application to map pseudo-locked memory +unless it runs with affinity to cores associated with the cache on which the +pseudo-locked region resides. The sanity check is only done during the +initial mmap() handling, there is no enforcement afterwards and the +application self needs to ensure it remains affine to the correct cores. + +Pseudo-locking is accomplished in two stages: + +1) During the first stage the system administrator allocates a portion + of cache that should be dedicated to pseudo-locking. At this time an + equivalent portion of memory is allocated, loaded into allocated + cache portion, and exposed as a character device. +2) During the second stage a user-space application maps (mmap()) the + pseudo-locked memory into its address space. + +Cache Pseudo-Locking Interface +------------------------------ +A pseudo-locked region is created using the resctrl interface as follows: + +1) Create a new resource group by creating a new directory in /sys/fs/resctrl. +2) Change the new resource group's mode to "pseudo-locksetup" by writing + "pseudo-locksetup" to the "mode" file. +3) Write the schemata of the pseudo-locked region to the "schemata" file. All + bits within the schemata should be "unused" according to the "bit_usage" + file. + +On successful pseudo-locked region creation the "mode" file will contain +"pseudo-locked" and a new character device with the same name as the resource +group will exist in /dev/pseudo_lock. This character device can be mmap()'ed +by user space in order to obtain access to the pseudo-locked memory region. + +An example of cache pseudo-locked region creation and usage can be found below. + +Cache Pseudo-Locking Debugging Interface +---------------------------------------- +The pseudo-locking debugging interface is enabled by default (if +CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl. + +There is no explicit way for the kernel to test if a provided memory +location is present in the cache. The pseudo-locking debugging interface uses +the tracing infrastructure to provide two ways to measure cache residency of +the pseudo-locked region: + +1) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data + from these measurements are best visualized using a hist trigger (see + example below). In this test the pseudo-locked region is traversed at + a stride of 32 bytes while hardware prefetchers and preemption + are disabled. This also provides a substitute visualization of cache + hits and misses. +2) Cache hit and miss measurements using model specific precision counters if + available. Depending on the levels of cache on the system the pseudo_lock_l2 + and pseudo_lock_l3 tracepoints are available. + +When a pseudo-locked region is created a new debugfs directory is created for +it in debugfs as /sys/kernel/debug/resctrl/. A single +write-only file, pseudo_lock_measure, is present in this directory. The +measurement of the pseudo-locked region depends on the number written to this +debugfs file: + +1: + writing "1" to the pseudo_lock_measure file will trigger the latency + measurement captured in the pseudo_lock_mem_latency tracepoint. See + example below. +2: + writing "2" to the pseudo_lock_measure file will trigger the L2 cache + residency (cache hits and misses) measurement captured in the + pseudo_lock_l2 tracepoint. See example below. +3: + writing "3" to the pseudo_lock_measure file will trigger the L3 cache + residency (cache hits and misses) measurement captured in the + pseudo_lock_l3 tracepoint. + +All measurements are recorded with the tracing infrastructure. This requires +the relevant tracepoints to be enabled before the measurement is triggered. + +Example of latency debugging interface +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +In this example a pseudo-locked region named "newlock" was created. Here is +how we can measure the latency in cycles of reading from this region and +visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS +is set:: + + # :> /sys/kernel/debug/tracing/trace + # echo 'hist:keys=latency' > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/trigger + # echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable + # echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure + # echo 0 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable + # cat /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/hist + + # event histogram + # + # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active] + # + + { latency: 456 } hitcount: 1 + { latency: 50 } hitcount: 83 + { latency: 36 } hitcount: 96 + { latency: 44 } hitcount: 174 + { latency: 48 } hitcount: 195 + { latency: 46 } hitcount: 262 + { latency: 42 } hitcount: 693 + { latency: 40 } hitcount: 3204 + { latency: 38 } hitcount: 3484 + + Totals: + Hits: 8192 + Entries: 9 + Dropped: 0 + +Example of cache hits/misses debugging +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +In this example a pseudo-locked region named "newlock" was created on the L2 +cache of a platform. Here is how we can obtain details of the cache hits +and misses using the platform's precision counters. +:: + + # :> /sys/kernel/debug/tracing/trace + # echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable + # echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure + # echo 0 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable + # cat /sys/kernel/debug/tracing/trace + + # tracer: nop + # + # _-----=> irqs-off + # / _----=> need-resched + # | / _---=> hardirq/softirq + # || / _--=> preempt-depth + # ||| / delay + # TASK-PID CPU# |||| TIMESTAMP FUNCTION + # | | | |||| | | + pseudo_lock_mea-1672 [002] .... 3132.860500: pseudo_lock_l2: hits=4097 miss=0 + + +Examples for RDT allocation usage +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +1) Example 1 + +On a two socket machine (one L3 cache per socket) with just four bits +for cache bit masks, minimum b/w of 10% with a memory bandwidth +granularity of 10%. +:: + + # mount -t resctrl resctrl /sys/fs/resctrl + # cd /sys/fs/resctrl + # mkdir p0 p1 + # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata + # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata + +The default resource group is unmodified, so we have access to all parts +of all caches (its schemata file reads "L3:0=f;1=f"). + +Tasks that are under the control of group "p0" may only allocate from the +"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. +Tasks in group "p1" use the "lower" 50% of cache on both sockets. + +Similarly, tasks that are under the control of group "p0" may use a +maximum memory b/w of 50% on socket0 and 50% on socket 1. +Tasks in group "p1" may also use 50% memory b/w on both sockets. +Note that unlike cache masks, memory b/w cannot specify whether these +allocations can overlap or not. The allocations specifies the maximum +b/w that the group may be able to use and the system admin can configure +the b/w accordingly. + +If the MBA is specified in MB(megabytes) then user can enter the max b/w in MB +rather than the percentage values. +:: + + # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata + # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata + +In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w +of 1024MB where as on socket 1 they would use 500MB. + +2) Example 2 + +Again two sockets, but this time with a more realistic 20-bit mask. + +Two real time tasks pid=1234 running on processor 0 and pid=5678 running on +processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy +neighbors, each of the two real-time tasks exclusively occupies one quarter +of L3 cache on socket 0. +:: + + # mount -t resctrl resctrl /sys/fs/resctrl + # cd /sys/fs/resctrl + +First we reset the schemata for the default group so that the "upper" +50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by +ordinary tasks:: + + # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata + +Next we make a resource group for our first real time task and give +it access to the "top" 25% of the cache on socket 0. +:: + + # mkdir p0 + # echo "L3:0=f8000;1=fffff" > p0/schemata + +Finally we move our first real time task into this resource group. We +also use taskset(1) to ensure the task always runs on a dedicated CPU +on socket 0. Most uses of resource groups will also constrain which +processors tasks run on. +:: + + # echo 1234 > p0/tasks + # taskset -cp 1 1234 + +Ditto for the second real time task (with the remaining 25% of cache):: + + # mkdir p1 + # echo "L3:0=7c00;1=fffff" > p1/schemata + # echo 5678 > p1/tasks + # taskset -cp 2 5678 + +For the same 2 socket system with memory b/w resource and CAT L3 the +schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is +10): + +For our first real time task this would request 20% memory b/w on socket 0. +:: + + # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata + +For our second real time task this would request an other 20% memory b/w +on socket 0. +:: + + # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata + +3) Example 3 + +A single socket system which has real-time tasks running on core 4-7 and +non real-time workload assigned to core 0-3. The real-time tasks share text +and data, so a per task association is not required and due to interaction +with the kernel it's desired that the kernel on these cores shares L3 with +the tasks. +:: + + # mount -t resctrl resctrl /sys/fs/resctrl + # cd /sys/fs/resctrl + +First we reset the schemata for the default group so that the "upper" +50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0 +cannot be used by ordinary tasks:: + + # echo "L3:0=3ff\nMB:0=50" > schemata + +Next we make a resource group for our real time cores and give it access +to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on +socket 0. +:: + + # mkdir p0 + # echo "L3:0=ffc00\nMB:0=50" > p0/schemata + +Finally we move core 4-7 over to the new group and make sure that the +kernel and the tasks running there get 50% of the cache. They should +also get 50% of memory bandwidth assuming that the cores 4-7 are SMT +siblings and only the real time threads are scheduled on the cores 4-7. +:: + + # echo F0 > p0/cpus + +4) Example 4 + +The resource groups in previous examples were all in the default "shareable" +mode allowing sharing of their cache allocations. If one resource group +configures a cache allocation then nothing prevents another resource group +to overlap with that allocation. + +In this example a new exclusive resource group will be created on a L2 CAT +system with two L2 cache instances that can be configured with an 8-bit +capacity bitmask. The new exclusive resource group will be configured to use +25% of each cache instance. +:: + + # mount -t resctrl resctrl /sys/fs/resctrl/ + # cd /sys/fs/resctrl + +First, we observe that the default group is configured to allocate to all L2 +cache:: + + # cat schemata + L2:0=ff;1=ff + +We could attempt to create the new resource group at this point, but it will +fail because of the overlap with the schemata of the default group:: + + # mkdir p0 + # echo 'L2:0=0x3;1=0x3' > p0/schemata + # cat p0/mode + shareable + # echo exclusive > p0/mode + -sh: echo: write error: Invalid argument + # cat info/last_cmd_status + schemata overlaps + +To ensure that there is no overlap with another resource group the default +resource group's schemata has to change, making it possible for the new +resource group to become exclusive. +:: + + # echo 'L2:0=0xfc;1=0xfc' > schemata + # echo exclusive > p0/mode + # grep . p0/* + p0/cpus:0 + p0/mode:exclusive + p0/schemata:L2:0=03;1=03 + p0/size:L2:0=262144;1=262144 + +A new resource group will on creation not overlap with an exclusive resource +group:: + + # mkdir p1 + # grep . p1/* + p1/cpus:0 + p1/mode:shareable + p1/schemata:L2:0=fc;1=fc + p1/size:L2:0=786432;1=786432 + +The bit_usage will reflect how the cache is used:: + + # cat info/L2/bit_usage + 0=SSSSSSEE;1=SSSSSSEE + +A resource group cannot be forced to overlap with an exclusive resource group:: + + # echo 'L2:0=0x1;1=0x1' > p1/schemata + -sh: echo: write error: Invalid argument + # cat info/last_cmd_status + overlaps with exclusive group + +Example of Cache Pseudo-Locking +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked +region is exposed at /dev/pseudo_lock/newlock that can be provided to +application for argument to mmap(). +:: + + # mount -t resctrl resctrl /sys/fs/resctrl/ + # cd /sys/fs/resctrl + +Ensure that there are bits available that can be pseudo-locked, since only +unused bits can be pseudo-locked the bits to be pseudo-locked needs to be +removed from the default resource group's schemata:: + + # cat info/L2/bit_usage + 0=SSSSSSSS;1=SSSSSSSS + # echo 'L2:1=0xfc' > schemata + # cat info/L2/bit_usage + 0=SSSSSSSS;1=SSSSSS00 + +Create a new resource group that will be associated with the pseudo-locked +region, indicate that it will be used for a pseudo-locked region, and +configure the requested pseudo-locked region capacity bitmask:: + + # mkdir newlock + # echo pseudo-locksetup > newlock/mode + # echo 'L2:1=0x3' > newlock/schemata + +On success the resource group's mode will change to pseudo-locked, the +bit_usage will reflect the pseudo-locked region, and the character device +exposing the pseudo-locked region will exist:: + + # cat newlock/mode + pseudo-locked + # cat info/L2/bit_usage + 0=SSSSSSSS;1=SSSSSSPP + # ls -l /dev/pseudo_lock/newlock + crw------- 1 root root 243, 0 Apr 3 05:01 /dev/pseudo_lock/newlock + +:: + + /* + * Example code to access one page of pseudo-locked cache region + * from user space. + */ + #define _GNU_SOURCE + #include + #include + #include + #include + #include + #include + + /* + * It is required that the application runs with affinity to only + * cores associated with the pseudo-locked region. Here the cpu + * is hardcoded for convenience of example. + */ + static int cpuid = 2; + + int main(int argc, char *argv[]) + { + cpu_set_t cpuset; + long page_size; + void *mapping; + int dev_fd; + int ret; + + page_size = sysconf(_SC_PAGESIZE); + + CPU_ZERO(&cpuset); + CPU_SET(cpuid, &cpuset); + ret = sched_setaffinity(0, sizeof(cpuset), &cpuset); + if (ret < 0) { + perror("sched_setaffinity"); + exit(EXIT_FAILURE); + } + + dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR); + if (dev_fd < 0) { + perror("open"); + exit(EXIT_FAILURE); + } + + mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, + dev_fd, 0); + if (mapping == MAP_FAILED) { + perror("mmap"); + close(dev_fd); + exit(EXIT_FAILURE); + } + + /* Application interacts with pseudo-locked memory @mapping */ + + ret = munmap(mapping, page_size); + if (ret < 0) { + perror("munmap"); + close(dev_fd); + exit(EXIT_FAILURE); + } + + close(dev_fd); + exit(EXIT_SUCCESS); + } + +Locking between applications +---------------------------- + +Certain operations on the resctrl filesystem, composed of read/writes +to/from multiple files, must be atomic. + +As an example, the allocation of an exclusive reservation of L3 cache +involves: + + 1. Read the cbmmasks from each directory or the per-resource "bit_usage" + 2. Find a contiguous set of bits in the global CBM bitmask that is clear + in any of the directory cbmmasks + 3. Create a new directory + 4. Set the bits found in step 2 to the new directory "schemata" file + +If two applications attempt to allocate space concurrently then they can +end up allocating the same bits so the reservations are shared instead of +exclusive. + +To coordinate atomic operations on the resctrlfs and to avoid the problem +above, the following locking procedure is recommended: + +Locking is based on flock, which is available in libc and also as a shell +script command + +Write lock: + + A) Take flock(LOCK_EX) on /sys/fs/resctrl + B) Read/write the directory structure. + C) funlock + +Read lock: + + A) Take flock(LOCK_SH) on /sys/fs/resctrl + B) If success read the directory structure. + C) funlock + +Example with bash:: + + # Atomically read directory structure + $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl + + # Read directory contents and create new subdirectory + + $ cat create-dir.sh + find /sys/fs/resctrl/ > output.txt + mask = function-of(output.txt) + mkdir /sys/fs/resctrl/newres/ + echo mask > /sys/fs/resctrl/newres/schemata + + $ flock /sys/fs/resctrl/ ./create-dir.sh + +Example with C:: + + /* + * Example code do take advisory locks + * before accessing resctrl filesystem + */ + #include + #include + + void resctrl_take_shared_lock(int fd) + { + int ret; + + /* take shared lock on resctrl filesystem */ + ret = flock(fd, LOCK_SH); + if (ret) { + perror("flock"); + exit(-1); + } + } + + void resctrl_take_exclusive_lock(int fd) + { + int ret; + + /* release lock on resctrl filesystem */ + ret = flock(fd, LOCK_EX); + if (ret) { + perror("flock"); + exit(-1); + } + } + + void resctrl_release_lock(int fd) + { + int ret; + + /* take shared lock on resctrl filesystem */ + ret = flock(fd, LOCK_UN); + if (ret) { + perror("flock"); + exit(-1); + } + } + + void main(void) + { + int fd, ret; + + fd = open("/sys/fs/resctrl", O_DIRECTORY); + if (fd == -1) { + perror("open"); + exit(-1); + } + resctrl_take_shared_lock(fd); + /* code to read directory contents */ + resctrl_release_lock(fd); + + resctrl_take_exclusive_lock(fd); + /* code to read and write directory contents */ + resctrl_release_lock(fd); + } + +Examples for RDT Monitoring along with allocation usage +======================================================= +Reading monitored data +---------------------- +Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would +show the current snapshot of LLC occupancy of the corresponding MON +group or CTRL_MON group. + + +Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group) +------------------------------------------------------------------------ +On a two socket machine (one L3 cache per socket) with just four bits +for cache bit masks:: + + # mount -t resctrl resctrl /sys/fs/resctrl + # cd /sys/fs/resctrl + # mkdir p0 p1 + # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata + # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata + # echo 5678 > p1/tasks + # echo 5679 > p1/tasks + +The default resource group is unmodified, so we have access to all parts +of all caches (its schemata file reads "L3:0=f;1=f"). + +Tasks that are under the control of group "p0" may only allocate from the +"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. +Tasks in group "p1" use the "lower" 50% of cache on both sockets. + +Create monitor groups and assign a subset of tasks to each monitor group. +:: + + # cd /sys/fs/resctrl/p1/mon_groups + # mkdir m11 m12 + # echo 5678 > m11/tasks + # echo 5679 > m12/tasks + +fetch data (data shown in bytes) +:: + + # cat m11/mon_data/mon_L3_00/llc_occupancy + 16234000 + # cat m11/mon_data/mon_L3_01/llc_occupancy + 14789000 + # cat m12/mon_data/mon_L3_00/llc_occupancy + 16789000 + +The parent ctrl_mon group shows the aggregated data. +:: + + # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy + 31234000 + +Example 2 (Monitor a task from its creation) +-------------------------------------------- +On a two socket machine (one L3 cache per socket):: + + # mount -t resctrl resctrl /sys/fs/resctrl + # cd /sys/fs/resctrl + # mkdir p0 p1 + +An RMID is allocated to the group once its created and hence the +below is monitored from its creation. +:: + + # echo $$ > /sys/fs/resctrl/p1/tasks + # + +Fetch the data:: + + # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy + 31789000 + +Example 3 (Monitor without CAT support or before creating CAT groups) +--------------------------------------------------------------------- + +Assume a system like HSW has only CQM and no CAT support. In this case +the resctrl will still mount but cannot create CTRL_MON directories. +But user can create different MON groups within the root group thereby +able to monitor all tasks including kernel threads. + +This can also be used to profile jobs cache size footprint before being +able to allocate them to different allocation groups. +:: + + # mount -t resctrl resctrl /sys/fs/resctrl + # cd /sys/fs/resctrl + # mkdir mon_groups/m01 + # mkdir mon_groups/m02 + + # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks + # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks + +Monitor the groups separately and also get per domain data. From the +below its apparent that the tasks are mostly doing work on +domain(socket) 0. +:: + + # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy + 31234000 + # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy + 34555 + # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy + 31234000 + # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy + 32789 + + +Example 4 (Monitor real time tasks) +----------------------------------- + +A single socket system which has real time tasks running on cores 4-7 +and non real time tasks on other cpus. We want to monitor the cache +occupancy of the real time threads on these cores. +:: + + # mount -t resctrl resctrl /sys/fs/resctrl + # cd /sys/fs/resctrl + # mkdir p1 + +Move the cpus 4-7 over to p1:: + + # echo f0 > p1/cpus + +View the llc occupancy snapshot:: + + # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy + 11234000 diff --git a/Documentation/x86/resctrl_ui.txt b/Documentation/x86/resctrl_ui.txt deleted file mode 100644 index c1f95b59e14d..000000000000 --- a/Documentation/x86/resctrl_ui.txt +++ /dev/null @@ -1,1121 +0,0 @@ -User Interface for Resource Control feature - -Intel refers to this feature as Intel Resource Director Technology(Intel(R) RDT). -AMD refers to this feature as AMD Platform Quality of Service(AMD QoS). - -Copyright (C) 2016 Intel Corporation - -Fenghua Yu -Tony Luck -Vikas Shivappa - -This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo -flag bits: -RDT (Resource Director Technology) Allocation - "rdt_a" -CAT (Cache Allocation Technology) - "cat_l3", "cat_l2" -CDP (Code and Data Prioritization ) - "cdp_l3", "cdp_l2" -CQM (Cache QoS Monitoring) - "cqm_llc", "cqm_occup_llc" -MBM (Memory Bandwidth Monitoring) - "cqm_mbm_total", "cqm_mbm_local" -MBA (Memory Bandwidth Allocation) - "mba" - -To use the feature mount the file system: - - # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl - -mount options are: - -"cdp": Enable code/data prioritization in L3 cache allocations. -"cdpl2": Enable code/data prioritization in L2 cache allocations. -"mba_MBps": Enable the MBA Software Controller(mba_sc) to specify MBA - bandwidth in MBps - -L2 and L3 CDP are controlled seperately. - -RDT features are orthogonal. A particular system may support only -monitoring, only control, or both monitoring and control. Cache -pseudo-locking is a unique way of using cache control to "pin" or -"lock" data in the cache. Details can be found in -"Cache Pseudo-Locking". - - -The mount succeeds if either of allocation or monitoring is present, but -only those files and directories supported by the system will be created. -For more details on the behavior of the interface during monitoring -and allocation, see the "Resource alloc and monitor groups" section. - -Info directory --------------- - -The 'info' directory contains information about the enabled -resources. Each resource has its own subdirectory. The subdirectory -names reflect the resource names. - -Each subdirectory contains the following files with respect to -allocation: - -Cache resource(L3/L2) subdirectory contains the following files -related to allocation: - -"num_closids": The number of CLOSIDs which are valid for this - resource. The kernel uses the smallest number of - CLOSIDs of all enabled resources as limit. - -"cbm_mask": The bitmask which is valid for this resource. - This mask is equivalent to 100%. - -"min_cbm_bits": The minimum number of consecutive bits which - must be set when writing a mask. - -"shareable_bits": Bitmask of shareable resource with other executing - entities (e.g. I/O). User can use this when - setting up exclusive cache partitions. Note that - some platforms support devices that have their - own settings for cache use which can over-ride - these bits. -"bit_usage": Annotated capacity bitmasks showing how all - instances of the resource are used. The legend is: - "0" - Corresponding region is unused. When the system's - resources have been allocated and a "0" is found - in "bit_usage" it is a sign that resources are - wasted. - "H" - Corresponding region is used by hardware only - but available for software use. If a resource - has bits set in "shareable_bits" but not all - of these bits appear in the resource groups' - schematas then the bits appearing in - "shareable_bits" but no resource group will - be marked as "H". - "X" - Corresponding region is available for sharing and - used by hardware and software. These are the - bits that appear in "shareable_bits" as - well as a resource group's allocation. - "S" - Corresponding region is used by software - and available for sharing. - "E" - Corresponding region is used exclusively by - one resource group. No sharing allowed. - "P" - Corresponding region is pseudo-locked. No - sharing allowed. - -Memory bandwitdh(MB) subdirectory contains the following files -with respect to allocation: - -"min_bandwidth": The minimum memory bandwidth percentage which - user can request. - -"bandwidth_gran": The granularity in which the memory bandwidth - percentage is allocated. The allocated - b/w percentage is rounded off to the next - control step available on the hardware. The - available bandwidth control steps are: - min_bandwidth + N * bandwidth_gran. - -"delay_linear": Indicates if the delay scale is linear or - non-linear. This field is purely informational - only. - -If RDT monitoring is available there will be an "L3_MON" directory -with the following files: - -"num_rmids": The number of RMIDs available. This is the - upper bound for how many "CTRL_MON" + "MON" - groups can be created. - -"mon_features": Lists the monitoring events if - monitoring is enabled for the resource. - -"max_threshold_occupancy": - Read/write file provides the largest value (in - bytes) at which a previously used LLC_occupancy - counter can be considered for re-use. - -Finally, in the top level of the "info" directory there is a file -named "last_cmd_status". This is reset with every "command" issued -via the file system (making new directories or writing to any of the -control files). If the command was successful, it will read as "ok". -If the command failed, it will provide more information that can be -conveyed in the error returns from file operations. E.g. - - # echo L3:0=f7 > schemata - bash: echo: write error: Invalid argument - # cat info/last_cmd_status - mask f7 has non-consecutive 1-bits - -Resource alloc and monitor groups ---------------------------------- - -Resource groups are represented as directories in the resctrl file -system. The default group is the root directory which, immediately -after mounting, owns all the tasks and cpus in the system and can make -full use of all resources. - -On a system with RDT control features additional directories can be -created in the root directory that specify different amounts of each -resource (see "schemata" below). The root and these additional top level -directories are referred to as "CTRL_MON" groups below. - -On a system with RDT monitoring the root directory and other top level -directories contain a directory named "mon_groups" in which additional -directories can be created to monitor subsets of tasks in the CTRL_MON -group that is their ancestor. These are called "MON" groups in the rest -of this document. - -Removing a directory will move all tasks and cpus owned by the group it -represents to the parent. Removing one of the created CTRL_MON groups -will automatically remove all MON groups below it. - -All groups contain the following files: - -"tasks": - Reading this file shows the list of all tasks that belong to - this group. Writing a task id to the file will add a task to the - group. If the group is a CTRL_MON group the task is removed from - whichever previous CTRL_MON group owned the task and also from - any MON group that owned the task. If the group is a MON group, - then the task must already belong to the CTRL_MON parent of this - group. The task is removed from any previous MON group. - - -"cpus": - Reading this file shows a bitmask of the logical CPUs owned by - this group. Writing a mask to this file will add and remove - CPUs to/from this group. As with the tasks file a hierarchy is - maintained where MON groups may only include CPUs owned by the - parent CTRL_MON group. - When the resouce group is in pseudo-locked mode this file will - only be readable, reflecting the CPUs associated with the - pseudo-locked region. - - -"cpus_list": - Just like "cpus", only using ranges of CPUs instead of bitmasks. - - -When control is enabled all CTRL_MON groups will also contain: - -"schemata": - A list of all the resources available to this group. - Each resource has its own line and format - see below for details. - -"size": - Mirrors the display of the "schemata" file to display the size in - bytes of each allocation instead of the bits representing the - allocation. - -"mode": - The "mode" of the resource group dictates the sharing of its - allocations. A "shareable" resource group allows sharing of its - allocations while an "exclusive" resource group does not. A - cache pseudo-locked region is created by first writing - "pseudo-locksetup" to the "mode" file before writing the cache - pseudo-locked region's schemata to the resource group's "schemata" - file. On successful pseudo-locked region creation the mode will - automatically change to "pseudo-locked". - -When monitoring is enabled all MON groups will also contain: - -"mon_data": - This contains a set of files organized by L3 domain and by - RDT event. E.g. on a system with two L3 domains there will - be subdirectories "mon_L3_00" and "mon_L3_01". Each of these - directories have one file per event (e.g. "llc_occupancy", - "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these - files provide a read out of the current value of the event for - all tasks in the group. In CTRL_MON groups these files provide - the sum for all tasks in the CTRL_MON group and all tasks in - MON groups. Please see example section for more details on usage. - -Resource allocation rules -------------------------- -When a task is running the following rules define which resources are -available to it: - -1) If the task is a member of a non-default group, then the schemata - for that group is used. - -2) Else if the task belongs to the default group, but is running on a - CPU that is assigned to some specific group, then the schemata for the - CPU's group is used. - -3) Otherwise the schemata for the default group is used. - -Resource monitoring rules -------------------------- -1) If a task is a member of a MON group, or non-default CTRL_MON group - then RDT events for the task will be reported in that group. - -2) If a task is a member of the default CTRL_MON group, but is running - on a CPU that is assigned to some specific group, then the RDT events - for the task will be reported in that group. - -3) Otherwise RDT events for the task will be reported in the root level - "mon_data" group. - - -Notes on cache occupancy monitoring and control ------------------------------------------------ -When moving a task from one group to another you should remember that -this only affects *new* cache allocations by the task. E.g. you may have -a task in a monitor group showing 3 MB of cache occupancy. If you move -to a new group and immediately check the occupancy of the old and new -groups you will likely see that the old group is still showing 3 MB and -the new group zero. When the task accesses locations still in cache from -before the move, the h/w does not update any counters. On a busy system -you will likely see the occupancy in the old group go down as cache lines -are evicted and re-used while the occupancy in the new group rises as -the task accesses memory and loads into the cache are counted based on -membership in the new group. - -The same applies to cache allocation control. Moving a task to a group -with a smaller cache partition will not evict any cache lines. The -process may continue to use them from the old partition. - -Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID) -to identify a control group and a monitoring group respectively. Each of -the resource groups are mapped to these IDs based on the kind of group. The -number of CLOSid and RMID are limited by the hardware and hence the creation of -a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID -and creation of "MON" group may fail if we run out of RMIDs. - -max_threshold_occupancy - generic concepts ------------------------------------------- - -Note that an RMID once freed may not be immediately available for use as -the RMID is still tagged the cache lines of the previous user of RMID. -Hence such RMIDs are placed on limbo list and checked back if the cache -occupancy has gone down. If there is a time when system has a lot of -limbo RMIDs but which are not ready to be used, user may see an -EBUSY -during mkdir. - -max_threshold_occupancy is a user configurable value to determine the -occupancy at which an RMID can be freed. - -Schemata files - general concepts ---------------------------------- -Each line in the file describes one resource. The line starts with -the name of the resource, followed by specific values to be applied -in each of the instances of that resource on the system. - -Cache IDs ---------- -On current generation systems there is one L3 cache per socket and L2 -caches are generally just shared by the hyperthreads on a core, but this -isn't an architectural requirement. We could have multiple separate L3 -caches on a socket, multiple cores could share an L2 cache. So instead -of using "socket" or "core" to define the set of logical cpus sharing -a resource we use a "Cache ID". At a given cache level this will be a -unique number across the whole system (but it isn't guaranteed to be a -contiguous sequence, there may be gaps). To find the ID for each logical -CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id - -Cache Bit Masks (CBM) ---------------------- -For cache resources we describe the portion of the cache that is available -for allocation using a bitmask. The maximum value of the mask is defined -by each cpu model (and may be different for different cache levels). It -is found using CPUID, but is also provided in the "info" directory of -the resctrl file system in "info/{resource}/cbm_mask". X86 hardware -requires that these masks have all the '1' bits in a contiguous block. So -0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9 -and 0xA are not. On a system with a 20-bit mask each bit represents 5% -of the capacity of the cache. You could partition the cache into four -equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000. - -Memory bandwidth Allocation and monitoring ------------------------------------------- - -For Memory bandwidth resource, by default the user controls the resource -by indicating the percentage of total memory bandwidth. - -The minimum bandwidth percentage value for each cpu model is predefined -and can be looked up through "info/MB/min_bandwidth". The bandwidth -granularity that is allocated is also dependent on the cpu model and can -be looked up at "info/MB/bandwidth_gran". The available bandwidth -control steps are: min_bw + N * bw_gran. Intermediate values are rounded -to the next control step available on the hardware. - -The bandwidth throttling is a core specific mechanism on some of Intel -SKUs. Using a high bandwidth and a low bandwidth setting on two threads -sharing a core will result in both threads being throttled to use the -low bandwidth. The fact that Memory bandwidth allocation(MBA) is a core -specific mechanism where as memory bandwidth monitoring(MBM) is done at -the package level may lead to confusion when users try to apply control -via the MBA and then monitor the bandwidth to see if the controls are -effective. Below are such scenarios: - -1. User may *not* see increase in actual bandwidth when percentage - values are increased: - -This can occur when aggregate L2 external bandwidth is more than L3 -external bandwidth. Consider an SKL SKU with 24 cores on a package and -where L2 external is 10GBps (hence aggregate L2 external bandwidth is -240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20 -threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3 -bandwidth of 100GBps although the percentage value specified is only 50% -<< 100%. Hence increasing the bandwidth percentage will not yeild any -more bandwidth. This is because although the L2 external bandwidth still -has capacity, the L3 external bandwidth is fully used. Also note that -this would be dependent on number of cores the benchmark is run on. - -2. Same bandwidth percentage may mean different actual bandwidth - depending on # of threads: - -For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4 -thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although -they have same percentage bandwidth of 10%. This is simply because as -threads start using more cores in an rdtgroup, the actual bandwidth may -increase or vary although user specified bandwidth percentage is same. - -In order to mitigate this and make the interface more user friendly, -resctrl added support for specifying the bandwidth in MBps as well. The -kernel underneath would use a software feedback mechanism or a "Software -Controller(mba_sc)" which reads the actual bandwidth using MBM counters -and adjust the memowy bandwidth percentages to ensure - - "actual bandwidth < user specified bandwidth". - -By default, the schemata would take the bandwidth percentage values -where as user can switch to the "MBA software controller" mode using -a mount option 'mba_MBps'. The schemata format is specified in the below -sections. - -L3 schemata file details (code and data prioritization disabled) ----------------------------------------------------------------- -With CDP disabled the L3 schemata format is: - - L3:=;=;... - -L3 schemata file details (CDP enabled via mount option to resctrl) ------------------------------------------------------------------- -When CDP is enabled L3 control is split into two separate resources -so you can specify independent masks for code and data like this: - - L3data:=;=;... - L3code:=;=;... - -L2 schemata file details ------------------------- -L2 cache does not support code and data prioritization, so the -schemata format is always: - - L2:=;=;... - -Memory bandwidth Allocation (default mode) ------------------------------------------- - -Memory b/w domain is L3 cache. - - MB:=bandwidth0;=bandwidth1;... - -Memory bandwidth Allocation specified in MBps ---------------------------------------------- - -Memory bandwidth domain is L3 cache. - - MB:=bw_MBps0;=bw_MBps1;... - -Reading/writing the schemata file ---------------------------------- -Reading the schemata file will show the state of all resources -on all domains. When writing you only need to specify those values -which you wish to change. E.g. - -# cat schemata -L3DATA:0=fffff;1=fffff;2=fffff;3=fffff -L3CODE:0=fffff;1=fffff;2=fffff;3=fffff -# echo "L3DATA:2=3c0;" > schemata -# cat schemata -L3DATA:0=fffff;1=fffff;2=3c0;3=fffff -L3CODE:0=fffff;1=fffff;2=fffff;3=fffff - -Cache Pseudo-Locking --------------------- -CAT enables a user to specify the amount of cache space that an -application can fill. Cache pseudo-locking builds on the fact that a -CPU can still read and write data pre-allocated outside its current -allocated area on a cache hit. With cache pseudo-locking, data can be -preloaded into a reserved portion of cache that no application can -fill, and from that point on will only serve cache hits. The cache -pseudo-locked memory is made accessible to user space where an -application can map it into its virtual address space and thus have -a region of memory with reduced average read latency. - -The creation of a cache pseudo-locked region is triggered by a request -from the user to do so that is accompanied by a schemata of the region -to be pseudo-locked. The cache pseudo-locked region is created as follows: -- Create a CAT allocation CLOSNEW with a CBM matching the schemata - from the user of the cache region that will contain the pseudo-locked - memory. This region must not overlap with any current CAT allocation/CLOS - on the system and no future overlap with this cache region is allowed - while the pseudo-locked region exists. -- Create a contiguous region of memory of the same size as the cache - region. -- Flush the cache, disable hardware prefetchers, disable preemption. -- Make CLOSNEW the active CLOS and touch the allocated memory to load - it into the cache. -- Set the previous CLOS as active. -- At this point the closid CLOSNEW can be released - the cache - pseudo-locked region is protected as long as its CBM does not appear in - any CAT allocation. Even though the cache pseudo-locked region will from - this point on not appear in any CBM of any CLOS an application running with - any CLOS will be able to access the memory in the pseudo-locked region since - the region continues to serve cache hits. -- The contiguous region of memory loaded into the cache is exposed to - user-space as a character device. - -Cache pseudo-locking increases the probability that data will remain -in the cache via carefully configuring the CAT feature and controlling -application behavior. There is no guarantee that data is placed in -cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict -“locked” data from cache. Power management C-states may shrink or -power off cache. Deeper C-states will automatically be restricted on -pseudo-locked region creation. - -It is required that an application using a pseudo-locked region runs -with affinity to the cores (or a subset of the cores) associated -with the cache on which the pseudo-locked region resides. A sanity check -within the code will not allow an application to map pseudo-locked memory -unless it runs with affinity to cores associated with the cache on which the -pseudo-locked region resides. The sanity check is only done during the -initial mmap() handling, there is no enforcement afterwards and the -application self needs to ensure it remains affine to the correct cores. - -Pseudo-locking is accomplished in two stages: -1) During the first stage the system administrator allocates a portion - of cache that should be dedicated to pseudo-locking. At this time an - equivalent portion of memory is allocated, loaded into allocated - cache portion, and exposed as a character device. -2) During the second stage a user-space application maps (mmap()) the - pseudo-locked memory into its address space. - -Cache Pseudo-Locking Interface ------------------------------- -A pseudo-locked region is created using the resctrl interface as follows: - -1) Create a new resource group by creating a new directory in /sys/fs/resctrl. -2) Change the new resource group's mode to "pseudo-locksetup" by writing - "pseudo-locksetup" to the "mode" file. -3) Write the schemata of the pseudo-locked region to the "schemata" file. All - bits within the schemata should be "unused" according to the "bit_usage" - file. - -On successful pseudo-locked region creation the "mode" file will contain -"pseudo-locked" and a new character device with the same name as the resource -group will exist in /dev/pseudo_lock. This character device can be mmap()'ed -by user space in order to obtain access to the pseudo-locked memory region. - -An example of cache pseudo-locked region creation and usage can be found below. - -Cache Pseudo-Locking Debugging Interface ---------------------------------------- -The pseudo-locking debugging interface is enabled by default (if -CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl. - -There is no explicit way for the kernel to test if a provided memory -location is present in the cache. The pseudo-locking debugging interface uses -the tracing infrastructure to provide two ways to measure cache residency of -the pseudo-locked region: -1) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data - from these measurements are best visualized using a hist trigger (see - example below). In this test the pseudo-locked region is traversed at - a stride of 32 bytes while hardware prefetchers and preemption - are disabled. This also provides a substitute visualization of cache - hits and misses. -2) Cache hit and miss measurements using model specific precision counters if - available. Depending on the levels of cache on the system the pseudo_lock_l2 - and pseudo_lock_l3 tracepoints are available. - -When a pseudo-locked region is created a new debugfs directory is created for -it in debugfs as /sys/kernel/debug/resctrl/. A single -write-only file, pseudo_lock_measure, is present in this directory. The -measurement of the pseudo-locked region depends on the number written to this -debugfs file: -1 - writing "1" to the pseudo_lock_measure file will trigger the latency - measurement captured in the pseudo_lock_mem_latency tracepoint. See - example below. -2 - writing "2" to the pseudo_lock_measure file will trigger the L2 cache - residency (cache hits and misses) measurement captured in the - pseudo_lock_l2 tracepoint. See example below. -3 - writing "3" to the pseudo_lock_measure file will trigger the L3 cache - residency (cache hits and misses) measurement captured in the - pseudo_lock_l3 tracepoint. - -All measurements are recorded with the tracing infrastructure. This requires -the relevant tracepoints to be enabled before the measurement is triggered. - -Example of latency debugging interface: -In this example a pseudo-locked region named "newlock" was created. Here is -how we can measure the latency in cycles of reading from this region and -visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS -is set: -# :> /sys/kernel/debug/tracing/trace -# echo 'hist:keys=latency' > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/trigger -# echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable -# echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure -# echo 0 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable -# cat /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/hist - -# event histogram -# -# trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active] -# - -{ latency: 456 } hitcount: 1 -{ latency: 50 } hitcount: 83 -{ latency: 36 } hitcount: 96 -{ latency: 44 } hitcount: 174 -{ latency: 48 } hitcount: 195 -{ latency: 46 } hitcount: 262 -{ latency: 42 } hitcount: 693 -{ latency: 40 } hitcount: 3204 -{ latency: 38 } hitcount: 3484 - -Totals: - Hits: 8192 - Entries: 9 - Dropped: 0 - -Example of cache hits/misses debugging: -In this example a pseudo-locked region named "newlock" was created on the L2 -cache of a platform. Here is how we can obtain details of the cache hits -and misses using the platform's precision counters. - -# :> /sys/kernel/debug/tracing/trace -# echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable -# echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure -# echo 0 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable -# cat /sys/kernel/debug/tracing/trace - -# tracer: nop -# -# _-----=> irqs-off -# / _----=> need-resched -# | / _---=> hardirq/softirq -# || / _--=> preempt-depth -# ||| / delay -# TASK-PID CPU# |||| TIMESTAMP FUNCTION -# | | | |||| | | - pseudo_lock_mea-1672 [002] .... 3132.860500: pseudo_lock_l2: hits=4097 miss=0 - - -Examples for RDT allocation usage: - -Example 1 ---------- -On a two socket machine (one L3 cache per socket) with just four bits -for cache bit masks, minimum b/w of 10% with a memory bandwidth -granularity of 10% - -# mount -t resctrl resctrl /sys/fs/resctrl -# cd /sys/fs/resctrl -# mkdir p0 p1 -# echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata -# echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata - -The default resource group is unmodified, so we have access to all parts -of all caches (its schemata file reads "L3:0=f;1=f"). - -Tasks that are under the control of group "p0" may only allocate from the -"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. -Tasks in group "p1" use the "lower" 50% of cache on both sockets. - -Similarly, tasks that are under the control of group "p0" may use a -maximum memory b/w of 50% on socket0 and 50% on socket 1. -Tasks in group "p1" may also use 50% memory b/w on both sockets. -Note that unlike cache masks, memory b/w cannot specify whether these -allocations can overlap or not. The allocations specifies the maximum -b/w that the group may be able to use and the system admin can configure -the b/w accordingly. - -If the MBA is specified in MB(megabytes) then user can enter the max b/w in MB -rather than the percentage values. - -# echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata -# echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata - -In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w -of 1024MB where as on socket 1 they would use 500MB. - -Example 2 ---------- -Again two sockets, but this time with a more realistic 20-bit mask. - -Two real time tasks pid=1234 running on processor 0 and pid=5678 running on -processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy -neighbors, each of the two real-time tasks exclusively occupies one quarter -of L3 cache on socket 0. - -# mount -t resctrl resctrl /sys/fs/resctrl -# cd /sys/fs/resctrl - -First we reset the schemata for the default group so that the "upper" -50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by -ordinary tasks: - -# echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata - -Next we make a resource group for our first real time task and give -it access to the "top" 25% of the cache on socket 0. - -# mkdir p0 -# echo "L3:0=f8000;1=fffff" > p0/schemata - -Finally we move our first real time task into this resource group. We -also use taskset(1) to ensure the task always runs on a dedicated CPU -on socket 0. Most uses of resource groups will also constrain which -processors tasks run on. - -# echo 1234 > p0/tasks -# taskset -cp 1 1234 - -Ditto for the second real time task (with the remaining 25% of cache): - -# mkdir p1 -# echo "L3:0=7c00;1=fffff" > p1/schemata -# echo 5678 > p1/tasks -# taskset -cp 2 5678 - -For the same 2 socket system with memory b/w resource and CAT L3 the -schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is -10): - -For our first real time task this would request 20% memory b/w on socket -0. - -# echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata - -For our second real time task this would request an other 20% memory b/w -on socket 0. - -# echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata - -Example 3 ---------- - -A single socket system which has real-time tasks running on core 4-7 and -non real-time workload assigned to core 0-3. The real-time tasks share text -and data, so a per task association is not required and due to interaction -with the kernel it's desired that the kernel on these cores shares L3 with -the tasks. - -# mount -t resctrl resctrl /sys/fs/resctrl -# cd /sys/fs/resctrl - -First we reset the schemata for the default group so that the "upper" -50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0 -cannot be used by ordinary tasks: - -# echo "L3:0=3ff\nMB:0=50" > schemata - -Next we make a resource group for our real time cores and give it access -to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on -socket 0. - -# mkdir p0 -# echo "L3:0=ffc00\nMB:0=50" > p0/schemata - -Finally we move core 4-7 over to the new group and make sure that the -kernel and the tasks running there get 50% of the cache. They should -also get 50% of memory bandwidth assuming that the cores 4-7 are SMT -siblings and only the real time threads are scheduled on the cores 4-7. - -# echo F0 > p0/cpus - -Example 4 ---------- - -The resource groups in previous examples were all in the default "shareable" -mode allowing sharing of their cache allocations. If one resource group -configures a cache allocation then nothing prevents another resource group -to overlap with that allocation. - -In this example a new exclusive resource group will be created on a L2 CAT -system with two L2 cache instances that can be configured with an 8-bit -capacity bitmask. The new exclusive resource group will be configured to use -25% of each cache instance. - -# mount -t resctrl resctrl /sys/fs/resctrl/ -# cd /sys/fs/resctrl - -First, we observe that the default group is configured to allocate to all L2 -cache: - -# cat schemata -L2:0=ff;1=ff - -We could attempt to create the new resource group at this point, but it will -fail because of the overlap with the schemata of the default group: -# mkdir p0 -# echo 'L2:0=0x3;1=0x3' > p0/schemata -# cat p0/mode -shareable -# echo exclusive > p0/mode --sh: echo: write error: Invalid argument -# cat info/last_cmd_status -schemata overlaps - -To ensure that there is no overlap with another resource group the default -resource group's schemata has to change, making it possible for the new -resource group to become exclusive. -# echo 'L2:0=0xfc;1=0xfc' > schemata -# echo exclusive > p0/mode -# grep . p0/* -p0/cpus:0 -p0/mode:exclusive -p0/schemata:L2:0=03;1=03 -p0/size:L2:0=262144;1=262144 - -A new resource group will on creation not overlap with an exclusive resource -group: -# mkdir p1 -# grep . p1/* -p1/cpus:0 -p1/mode:shareable -p1/schemata:L2:0=fc;1=fc -p1/size:L2:0=786432;1=786432 - -The bit_usage will reflect how the cache is used: -# cat info/L2/bit_usage -0=SSSSSSEE;1=SSSSSSEE - -A resource group cannot be forced to overlap with an exclusive resource group: -# echo 'L2:0=0x1;1=0x1' > p1/schemata --sh: echo: write error: Invalid argument -# cat info/last_cmd_status -overlaps with exclusive group - -Example of Cache Pseudo-Locking -------------------------------- -Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked -region is exposed at /dev/pseudo_lock/newlock that can be provided to -application for argument to mmap(). - -# mount -t resctrl resctrl /sys/fs/resctrl/ -# cd /sys/fs/resctrl - -Ensure that there are bits available that can be pseudo-locked, since only -unused bits can be pseudo-locked the bits to be pseudo-locked needs to be -removed from the default resource group's schemata: -# cat info/L2/bit_usage -0=SSSSSSSS;1=SSSSSSSS -# echo 'L2:1=0xfc' > schemata -# cat info/L2/bit_usage -0=SSSSSSSS;1=SSSSSS00 - -Create a new resource group that will be associated with the pseudo-locked -region, indicate that it will be used for a pseudo-locked region, and -configure the requested pseudo-locked region capacity bitmask: - -# mkdir newlock -# echo pseudo-locksetup > newlock/mode -# echo 'L2:1=0x3' > newlock/schemata - -On success the resource group's mode will change to pseudo-locked, the -bit_usage will reflect the pseudo-locked region, and the character device -exposing the pseudo-locked region will exist: - -# cat newlock/mode -pseudo-locked -# cat info/L2/bit_usage -0=SSSSSSSS;1=SSSSSSPP -# ls -l /dev/pseudo_lock/newlock -crw------- 1 root root 243, 0 Apr 3 05:01 /dev/pseudo_lock/newlock - -/* - * Example code to access one page of pseudo-locked cache region - * from user space. - */ -#define _GNU_SOURCE -#include -#include -#include -#include -#include -#include - -/* - * It is required that the application runs with affinity to only - * cores associated with the pseudo-locked region. Here the cpu - * is hardcoded for convenience of example. - */ -static int cpuid = 2; - -int main(int argc, char *argv[]) -{ - cpu_set_t cpuset; - long page_size; - void *mapping; - int dev_fd; - int ret; - - page_size = sysconf(_SC_PAGESIZE); - - CPU_ZERO(&cpuset); - CPU_SET(cpuid, &cpuset); - ret = sched_setaffinity(0, sizeof(cpuset), &cpuset); - if (ret < 0) { - perror("sched_setaffinity"); - exit(EXIT_FAILURE); - } - - dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR); - if (dev_fd < 0) { - perror("open"); - exit(EXIT_FAILURE); - } - - mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, - dev_fd, 0); - if (mapping == MAP_FAILED) { - perror("mmap"); - close(dev_fd); - exit(EXIT_FAILURE); - } - - /* Application interacts with pseudo-locked memory @mapping */ - - ret = munmap(mapping, page_size); - if (ret < 0) { - perror("munmap"); - close(dev_fd); - exit(EXIT_FAILURE); - } - - close(dev_fd); - exit(EXIT_SUCCESS); -} - -Locking between applications ----------------------------- - -Certain operations on the resctrl filesystem, composed of read/writes -to/from multiple files, must be atomic. - -As an example, the allocation of an exclusive reservation of L3 cache -involves: - - 1. Read the cbmmasks from each directory or the per-resource "bit_usage" - 2. Find a contiguous set of bits in the global CBM bitmask that is clear - in any of the directory cbmmasks - 3. Create a new directory - 4. Set the bits found in step 2 to the new directory "schemata" file - -If two applications attempt to allocate space concurrently then they can -end up allocating the same bits so the reservations are shared instead of -exclusive. - -To coordinate atomic operations on the resctrlfs and to avoid the problem -above, the following locking procedure is recommended: - -Locking is based on flock, which is available in libc and also as a shell -script command - -Write lock: - - A) Take flock(LOCK_EX) on /sys/fs/resctrl - B) Read/write the directory structure. - C) funlock - -Read lock: - - A) Take flock(LOCK_SH) on /sys/fs/resctrl - B) If success read the directory structure. - C) funlock - -Example with bash: - -# Atomically read directory structure -$ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl - -# Read directory contents and create new subdirectory - -$ cat create-dir.sh -find /sys/fs/resctrl/ > output.txt -mask = function-of(output.txt) -mkdir /sys/fs/resctrl/newres/ -echo mask > /sys/fs/resctrl/newres/schemata - -$ flock /sys/fs/resctrl/ ./create-dir.sh - -Example with C: - -/* - * Example code do take advisory locks - * before accessing resctrl filesystem - */ -#include -#include - -void resctrl_take_shared_lock(int fd) -{ - int ret; - - /* take shared lock on resctrl filesystem */ - ret = flock(fd, LOCK_SH); - if (ret) { - perror("flock"); - exit(-1); - } -} - -void resctrl_take_exclusive_lock(int fd) -{ - int ret; - - /* release lock on resctrl filesystem */ - ret = flock(fd, LOCK_EX); - if (ret) { - perror("flock"); - exit(-1); - } -} - -void resctrl_release_lock(int fd) -{ - int ret; - - /* take shared lock on resctrl filesystem */ - ret = flock(fd, LOCK_UN); - if (ret) { - perror("flock"); - exit(-1); - } -} - -void main(void) -{ - int fd, ret; - - fd = open("/sys/fs/resctrl", O_DIRECTORY); - if (fd == -1) { - perror("open"); - exit(-1); - } - resctrl_take_shared_lock(fd); - /* code to read directory contents */ - resctrl_release_lock(fd); - - resctrl_take_exclusive_lock(fd); - /* code to read and write directory contents */ - resctrl_release_lock(fd); -} - -Examples for RDT Monitoring along with allocation usage: - -Reading monitored data ----------------------- -Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would -show the current snapshot of LLC occupancy of the corresponding MON -group or CTRL_MON group. - - -Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group) ---------- -On a two socket machine (one L3 cache per socket) with just four bits -for cache bit masks - -# mount -t resctrl resctrl /sys/fs/resctrl -# cd /sys/fs/resctrl -# mkdir p0 p1 -# echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata -# echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata -# echo 5678 > p1/tasks -# echo 5679 > p1/tasks - -The default resource group is unmodified, so we have access to all parts -of all caches (its schemata file reads "L3:0=f;1=f"). - -Tasks that are under the control of group "p0" may only allocate from the -"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. -Tasks in group "p1" use the "lower" 50% of cache on both sockets. - -Create monitor groups and assign a subset of tasks to each monitor group. - -# cd /sys/fs/resctrl/p1/mon_groups -# mkdir m11 m12 -# echo 5678 > m11/tasks -# echo 5679 > m12/tasks - -fetch data (data shown in bytes) - -# cat m11/mon_data/mon_L3_00/llc_occupancy -16234000 -# cat m11/mon_data/mon_L3_01/llc_occupancy -14789000 -# cat m12/mon_data/mon_L3_00/llc_occupancy -16789000 - -The parent ctrl_mon group shows the aggregated data. - -# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy -31234000 - -Example 2 (Monitor a task from its creation) ---------- -On a two socket machine (one L3 cache per socket) - -# mount -t resctrl resctrl /sys/fs/resctrl -# cd /sys/fs/resctrl -# mkdir p0 p1 - -An RMID is allocated to the group once its created and hence the -below is monitored from its creation. - -# echo $$ > /sys/fs/resctrl/p1/tasks -# - -Fetch the data - -# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy -31789000 - -Example 3 (Monitor without CAT support or before creating CAT groups) ---------- - -Assume a system like HSW has only CQM and no CAT support. In this case -the resctrl will still mount but cannot create CTRL_MON directories. -But user can create different MON groups within the root group thereby -able to monitor all tasks including kernel threads. - -This can also be used to profile jobs cache size footprint before being -able to allocate them to different allocation groups. - -# mount -t resctrl resctrl /sys/fs/resctrl -# cd /sys/fs/resctrl -# mkdir mon_groups/m01 -# mkdir mon_groups/m02 - -# echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks -# echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks - -Monitor the groups separately and also get per domain data. From the -below its apparent that the tasks are mostly doing work on -domain(socket) 0. - -# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy -31234000 -# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy -34555 -# cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy -31234000 -# cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy -32789 - - -Example 4 (Monitor real time tasks) ------------------------------------ - -A single socket system which has real time tasks running on cores 4-7 -and non real time tasks on other cpus. We want to monitor the cache -occupancy of the real time threads on these cores. - -# mount -t resctrl resctrl /sys/fs/resctrl -# cd /sys/fs/resctrl -# mkdir p1 - -Move the cpus 4-7 over to p1 -# echo f0 > p1/cpus - -View the llc occupancy snapshot - -# cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy -11234000 -- cgit v1.2.3 From 9d12f58fe91e5e72ef08afe9e3e66d1c755fc085 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:32 +0800 Subject: Documentation: x86: convert orc-unwinder.txt to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/index.rst | 1 + Documentation/x86/orc-unwinder.rst | 182 +++++++++++++++++++++++++++++++++++++ Documentation/x86/orc-unwinder.txt | 179 ------------------------------------ 3 files changed, 183 insertions(+), 179 deletions(-) create mode 100644 Documentation/x86/orc-unwinder.rst delete mode 100644 Documentation/x86/orc-unwinder.txt (limited to 'Documentation/x86') diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index 6e3c887a0c3b..453557097743 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -14,6 +14,7 @@ x86-specific Documentation kernel-stacks entry_64 earlyprintk + orc-unwinder zero-page tlb mtrr diff --git a/Documentation/x86/orc-unwinder.rst b/Documentation/x86/orc-unwinder.rst new file mode 100644 index 000000000000..d811576c1f3e --- /dev/null +++ b/Documentation/x86/orc-unwinder.rst @@ -0,0 +1,182 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============ +ORC unwinder +============ + +Overview +======== + +The kernel CONFIG_UNWINDER_ORC option enables the ORC unwinder, which is +similar in concept to a DWARF unwinder. The difference is that the +format of the ORC data is much simpler than DWARF, which in turn allows +the ORC unwinder to be much simpler and faster. + +The ORC data consists of unwind tables which are generated by objtool. +They contain out-of-band data which is used by the in-kernel ORC +unwinder. Objtool generates the ORC data by first doing compile-time +stack metadata validation (CONFIG_STACK_VALIDATION). After analyzing +all the code paths of a .o file, it determines information about the +stack state at each instruction address in the file and outputs that +information to the .orc_unwind and .orc_unwind_ip sections. + +The per-object ORC sections are combined at link time and are sorted and +post-processed at boot time. The unwinder uses the resulting data to +correlate instruction addresses with their stack states at run time. + + +ORC vs frame pointers +===================== + +With frame pointers enabled, GCC adds instrumentation code to every +function in the kernel. The kernel's .text size increases by about +3.2%, resulting in a broad kernel-wide slowdown. Measurements by Mel +Gorman [1]_ have shown a slowdown of 5-10% for some workloads. + +In contrast, the ORC unwinder has no effect on text size or runtime +performance, because the debuginfo is out of band. So if you disable +frame pointers and enable the ORC unwinder, you get a nice performance +improvement across the board, and still have reliable stack traces. + +Ingo Molnar says: + + "Note that it's not just a performance improvement, but also an + instruction cache locality improvement: 3.2% .text savings almost + directly transform into a similarly sized reduction in cache + footprint. That can transform to even higher speedups for workloads + whose cache locality is borderline." + +Another benefit of ORC compared to frame pointers is that it can +reliably unwind across interrupts and exceptions. Frame pointer based +unwinds can sometimes skip the caller of the interrupted function, if it +was a leaf function or if the interrupt hit before the frame pointer was +saved. + +The main disadvantage of the ORC unwinder compared to frame pointers is +that it needs more memory to store the ORC unwind tables: roughly 2-4MB +depending on the kernel config. + + +ORC vs DWARF +============ + +ORC debuginfo's advantage over DWARF itself is that it's much simpler. +It gets rid of the complex DWARF CFI state machine and also gets rid of +the tracking of unnecessary registers. This allows the unwinder to be +much simpler, meaning fewer bugs, which is especially important for +mission critical oops code. + +The simpler debuginfo format also enables the unwinder to be much faster +than DWARF, which is important for perf and lockdep. In a basic +performance test by Jiri Slaby [2]_, the ORC unwinder was about 20x +faster than an out-of-tree DWARF unwinder. (Note: That measurement was +taken before some performance tweaks were added, which doubled +performance, so the speedup over DWARF may be closer to 40x.) + +The ORC data format does have a few downsides compared to DWARF. ORC +unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig kernel) +than DWARF-based eh_frame tables. + +Another potential downside is that, as GCC evolves, it's conceivable +that the ORC data may end up being *too* simple to describe the state of +the stack for certain optimizations. But IMO this is unlikely because +GCC saves the frame pointer for any unusual stack adjustments it does, +so I suspect we'll really only ever need to keep track of the stack +pointer and the frame pointer between call frames. But even if we do +end up having to track all the registers DWARF tracks, at least we will +still be able to control the format, e.g. no complex state machines. + + +ORC unwind table generation +=========================== + +The ORC data is generated by objtool. With the existing compile-time +stack metadata validation feature, objtool already follows all code +paths, and so it already has all the information it needs to be able to +generate ORC data from scratch. So it's an easy step to go from stack +validation to ORC data generation. + +It should be possible to instead generate the ORC data with a simple +tool which converts DWARF to ORC data. However, such a solution would +be incomplete due to the kernel's extensive use of asm, inline asm, and +special sections like exception tables. + +That could be rectified by manually annotating those special code paths +using GNU assembler .cfi annotations in .S files, and homegrown +annotations for inline asm in .c files. But asm annotations were tried +in the past and were found to be unmaintainable. They were often +incorrect/incomplete and made the code harder to read and keep updated. +And based on looking at glibc code, annotating inline asm in .c files +might be even worse. + +Objtool still needs a few annotations, but only in code which does +unusual things to the stack like entry code. And even then, far fewer +annotations are needed than what DWARF would need, so they're much more +maintainable than DWARF CFI annotations. + +So the advantages of using objtool to generate ORC data are that it +gives more accurate debuginfo, with very few annotations. It also +insulates the kernel from toolchain bugs which can be very painful to +deal with in the kernel since we often have to workaround issues in +older versions of the toolchain for years. + +The downside is that the unwinder now becomes dependent on objtool's +ability to reverse engineer GCC code flow. If GCC optimizations become +too complicated for objtool to follow, the ORC data generation might +stop working or become incomplete. (It's worth noting that livepatch +already has such a dependency on objtool's ability to follow GCC code +flow.) + +If newer versions of GCC come up with some optimizations which break +objtool, we may need to revisit the current implementation. Some +possible solutions would be asking GCC to make the optimizations more +palatable, or having objtool use DWARF as an additional input, or +creating a GCC plugin to assist objtool with its analysis. But for now, +objtool follows GCC code quite well. + + +Unwinder implementation details +=============================== + +Objtool generates the ORC data by integrating with the compile-time +stack metadata validation feature, which is described in detail in +tools/objtool/Documentation/stack-validation.txt. After analyzing all +the code paths of a .o file, it creates an array of orc_entry structs, +and a parallel array of instruction addresses associated with those +structs, and writes them to the .orc_unwind and .orc_unwind_ip sections +respectively. + +The ORC data is split into the two arrays for performance reasons, to +make the searchable part of the data (.orc_unwind_ip) more compact. The +arrays are sorted in parallel at boot time. + +Performance is further improved by the use of a fast lookup table which +is created at runtime. The fast lookup table associates a given address +with a range of indices for the .orc_unwind table, so that only a small +subset of the table needs to be searched. + + +Etymology +========= + +Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural +enemies. Similarly, the ORC unwinder was created in opposition to the +complexity and slowness of DWARF. + +"Although Orcs rarely consider multiple solutions to a problem, they do +excel at getting things done because they are creatures of action, not +thought." [3]_ Similarly, unlike the esoteric DWARF unwinder, the +veracious ORC unwinder wastes no time or siloconic effort decoding +variable-length zero-extended unsigned-integer byte-coded +state-machine-based debug information entries. + +Similar to how Orcs frequently unravel the well-intentioned plans of +their adversaries, the ORC unwinder frequently unravels stacks with +brutal, unyielding efficiency. + +ORC stands for Oops Rewind Capability. + + +.. [1] https://lkml.kernel.org/r/20170602104048.jkkzssljsompjdwy@suse.de +.. [2] https://lkml.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@suse.cz +.. [3] http://dustin.wikidot.com/half-orcs-and-orcs diff --git a/Documentation/x86/orc-unwinder.txt b/Documentation/x86/orc-unwinder.txt deleted file mode 100644 index cd4b29be29af..000000000000 --- a/Documentation/x86/orc-unwinder.txt +++ /dev/null @@ -1,179 +0,0 @@ -ORC unwinder -============ - -Overview --------- - -The kernel CONFIG_UNWINDER_ORC option enables the ORC unwinder, which is -similar in concept to a DWARF unwinder. The difference is that the -format of the ORC data is much simpler than DWARF, which in turn allows -the ORC unwinder to be much simpler and faster. - -The ORC data consists of unwind tables which are generated by objtool. -They contain out-of-band data which is used by the in-kernel ORC -unwinder. Objtool generates the ORC data by first doing compile-time -stack metadata validation (CONFIG_STACK_VALIDATION). After analyzing -all the code paths of a .o file, it determines information about the -stack state at each instruction address in the file and outputs that -information to the .orc_unwind and .orc_unwind_ip sections. - -The per-object ORC sections are combined at link time and are sorted and -post-processed at boot time. The unwinder uses the resulting data to -correlate instruction addresses with their stack states at run time. - - -ORC vs frame pointers ---------------------- - -With frame pointers enabled, GCC adds instrumentation code to every -function in the kernel. The kernel's .text size increases by about -3.2%, resulting in a broad kernel-wide slowdown. Measurements by Mel -Gorman [1] have shown a slowdown of 5-10% for some workloads. - -In contrast, the ORC unwinder has no effect on text size or runtime -performance, because the debuginfo is out of band. So if you disable -frame pointers and enable the ORC unwinder, you get a nice performance -improvement across the board, and still have reliable stack traces. - -Ingo Molnar says: - - "Note that it's not just a performance improvement, but also an - instruction cache locality improvement: 3.2% .text savings almost - directly transform into a similarly sized reduction in cache - footprint. That can transform to even higher speedups for workloads - whose cache locality is borderline." - -Another benefit of ORC compared to frame pointers is that it can -reliably unwind across interrupts and exceptions. Frame pointer based -unwinds can sometimes skip the caller of the interrupted function, if it -was a leaf function or if the interrupt hit before the frame pointer was -saved. - -The main disadvantage of the ORC unwinder compared to frame pointers is -that it needs more memory to store the ORC unwind tables: roughly 2-4MB -depending on the kernel config. - - -ORC vs DWARF ------------- - -ORC debuginfo's advantage over DWARF itself is that it's much simpler. -It gets rid of the complex DWARF CFI state machine and also gets rid of -the tracking of unnecessary registers. This allows the unwinder to be -much simpler, meaning fewer bugs, which is especially important for -mission critical oops code. - -The simpler debuginfo format also enables the unwinder to be much faster -than DWARF, which is important for perf and lockdep. In a basic -performance test by Jiri Slaby [2], the ORC unwinder was about 20x -faster than an out-of-tree DWARF unwinder. (Note: That measurement was -taken before some performance tweaks were added, which doubled -performance, so the speedup over DWARF may be closer to 40x.) - -The ORC data format does have a few downsides compared to DWARF. ORC -unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig kernel) -than DWARF-based eh_frame tables. - -Another potential downside is that, as GCC evolves, it's conceivable -that the ORC data may end up being *too* simple to describe the state of -the stack for certain optimizations. But IMO this is unlikely because -GCC saves the frame pointer for any unusual stack adjustments it does, -so I suspect we'll really only ever need to keep track of the stack -pointer and the frame pointer between call frames. But even if we do -end up having to track all the registers DWARF tracks, at least we will -still be able to control the format, e.g. no complex state machines. - - -ORC unwind table generation ---------------------------- - -The ORC data is generated by objtool. With the existing compile-time -stack metadata validation feature, objtool already follows all code -paths, and so it already has all the information it needs to be able to -generate ORC data from scratch. So it's an easy step to go from stack -validation to ORC data generation. - -It should be possible to instead generate the ORC data with a simple -tool which converts DWARF to ORC data. However, such a solution would -be incomplete due to the kernel's extensive use of asm, inline asm, and -special sections like exception tables. - -That could be rectified by manually annotating those special code paths -using GNU assembler .cfi annotations in .S files, and homegrown -annotations for inline asm in .c files. But asm annotations were tried -in the past and were found to be unmaintainable. They were often -incorrect/incomplete and made the code harder to read and keep updated. -And based on looking at glibc code, annotating inline asm in .c files -might be even worse. - -Objtool still needs a few annotations, but only in code which does -unusual things to the stack like entry code. And even then, far fewer -annotations are needed than what DWARF would need, so they're much more -maintainable than DWARF CFI annotations. - -So the advantages of using objtool to generate ORC data are that it -gives more accurate debuginfo, with very few annotations. It also -insulates the kernel from toolchain bugs which can be very painful to -deal with in the kernel since we often have to workaround issues in -older versions of the toolchain for years. - -The downside is that the unwinder now becomes dependent on objtool's -ability to reverse engineer GCC code flow. If GCC optimizations become -too complicated for objtool to follow, the ORC data generation might -stop working or become incomplete. (It's worth noting that livepatch -already has such a dependency on objtool's ability to follow GCC code -flow.) - -If newer versions of GCC come up with some optimizations which break -objtool, we may need to revisit the current implementation. Some -possible solutions would be asking GCC to make the optimizations more -palatable, or having objtool use DWARF as an additional input, or -creating a GCC plugin to assist objtool with its analysis. But for now, -objtool follows GCC code quite well. - - -Unwinder implementation details -------------------------------- - -Objtool generates the ORC data by integrating with the compile-time -stack metadata validation feature, which is described in detail in -tools/objtool/Documentation/stack-validation.txt. After analyzing all -the code paths of a .o file, it creates an array of orc_entry structs, -and a parallel array of instruction addresses associated with those -structs, and writes them to the .orc_unwind and .orc_unwind_ip sections -respectively. - -The ORC data is split into the two arrays for performance reasons, to -make the searchable part of the data (.orc_unwind_ip) more compact. The -arrays are sorted in parallel at boot time. - -Performance is further improved by the use of a fast lookup table which -is created at runtime. The fast lookup table associates a given address -with a range of indices for the .orc_unwind table, so that only a small -subset of the table needs to be searched. - - -Etymology ---------- - -Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural -enemies. Similarly, the ORC unwinder was created in opposition to the -complexity and slowness of DWARF. - -"Although Orcs rarely consider multiple solutions to a problem, they do -excel at getting things done because they are creatures of action, not -thought." [3] Similarly, unlike the esoteric DWARF unwinder, the -veracious ORC unwinder wastes no time or siloconic effort decoding -variable-length zero-extended unsigned-integer byte-coded -state-machine-based debug information entries. - -Similar to how Orcs frequently unravel the well-intentioned plans of -their adversaries, the ORC unwinder frequently unravels stacks with -brutal, unyielding efficiency. - -ORC stands for Oops Rewind Capability. - - -[1] https://lkml.kernel.org/r/20170602104048.jkkzssljsompjdwy@suse.de -[2] https://lkml.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@suse.cz -[3] http://dustin.wikidot.com/half-orcs-and-orcs -- cgit v1.2.3 From 71892b25fc49999071472f6bce589c18468a85a8 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:33 +0800 Subject: Documentation: x86: convert usb-legacy-support.txt to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/index.rst | 1 + Documentation/x86/usb-legacy-support.rst | 50 ++++++++++++++++++++++++++++++++ Documentation/x86/usb-legacy-support.txt | 44 ---------------------------- 3 files changed, 51 insertions(+), 44 deletions(-) create mode 100644 Documentation/x86/usb-legacy-support.rst delete mode 100644 Documentation/x86/usb-legacy-support.txt (limited to 'Documentation/x86') diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index 453557097743..3eb0334ae2d4 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -25,3 +25,4 @@ x86-specific Documentation pti microcode resctrl_ui + usb-legacy-support diff --git a/Documentation/x86/usb-legacy-support.rst b/Documentation/x86/usb-legacy-support.rst new file mode 100644 index 000000000000..e01c08b7c981 --- /dev/null +++ b/Documentation/x86/usb-legacy-support.rst @@ -0,0 +1,50 @@ + +.. SPDX-License-Identifier: GPL-2.0 + +================== +USB Legacy support +================== + +:Author: Vojtech Pavlik , January 2004 + + +Also known as "USB Keyboard" or "USB Mouse support" in the BIOS Setup is a +feature that allows one to use the USB mouse and keyboard as if they were +their classic PS/2 counterparts. This means one can use an USB keyboard to +type in LILO for example. + +It has several drawbacks, though: + +1) On some machines, the emulated PS/2 mouse takes over even when no USB + mouse is present and a real PS/2 mouse is present. In that case the extra + features (wheel, extra buttons, touchpad mode) of the real PS/2 mouse may + not be available. + +2) If CONFIG_HIGHMEM64G is enabled, the PS/2 mouse emulation can cause + system crashes, because the SMM BIOS is not expecting to be in PAE mode. + The Intel E7505 is a typical machine where this happens. + +3) If AMD64 64-bit mode is enabled, again system crashes often happen, + because the SMM BIOS isn't expecting the CPU to be in 64-bit mode. The + BIOS manufacturers only test with Windows, and Windows doesn't do 64-bit + yet. + +Solutions: + +Problem 1) + can be solved by loading the USB drivers prior to loading the + PS/2 mouse driver. Since the PS/2 mouse driver is in 2.6 compiled into + the kernel unconditionally, this means the USB drivers need to be + compiled-in, too. + +Problem 2) + can currently only be solved by either disabling HIGHMEM64G + in the kernel config or USB Legacy support in the BIOS. A BIOS update + could help, but so far no such update exists. + +Problem 3) + is usually fixed by a BIOS update. Check the board + manufacturers web site. If an update is not available, disable USB + Legacy support in the BIOS. If this alone doesn't help, try also adding + idle=poll on the kernel command line. The BIOS may be entering the SMM + on the HLT instruction as well. diff --git a/Documentation/x86/usb-legacy-support.txt b/Documentation/x86/usb-legacy-support.txt deleted file mode 100644 index 1894cdfc69d9..000000000000 --- a/Documentation/x86/usb-legacy-support.txt +++ /dev/null @@ -1,44 +0,0 @@ -USB Legacy support -~~~~~~~~~~~~~~~~~~ - -Vojtech Pavlik , January 2004 - - -Also known as "USB Keyboard" or "USB Mouse support" in the BIOS Setup is a -feature that allows one to use the USB mouse and keyboard as if they were -their classic PS/2 counterparts. This means one can use an USB keyboard to -type in LILO for example. - -It has several drawbacks, though: - -1) On some machines, the emulated PS/2 mouse takes over even when no USB - mouse is present and a real PS/2 mouse is present. In that case the extra - features (wheel, extra buttons, touchpad mode) of the real PS/2 mouse may - not be available. - -2) If CONFIG_HIGHMEM64G is enabled, the PS/2 mouse emulation can cause - system crashes, because the SMM BIOS is not expecting to be in PAE mode. - The Intel E7505 is a typical machine where this happens. - -3) If AMD64 64-bit mode is enabled, again system crashes often happen, - because the SMM BIOS isn't expecting the CPU to be in 64-bit mode. The - BIOS manufacturers only test with Windows, and Windows doesn't do 64-bit - yet. - -Solutions: - -Problem 1) can be solved by loading the USB drivers prior to loading the -PS/2 mouse driver. Since the PS/2 mouse driver is in 2.6 compiled into -the kernel unconditionally, this means the USB drivers need to be -compiled-in, too. - -Problem 2) can currently only be solved by either disabling HIGHMEM64G -in the kernel config or USB Legacy support in the BIOS. A BIOS update -could help, but so far no such update exists. - -Problem 3) is usually fixed by a BIOS update. Check the board -manufacturers web site. If an update is not available, disable USB -Legacy support in the BIOS. If this alone doesn't help, try also adding -idle=poll on the kernel command line. The BIOS may be entering the SMM -on the HLT instruction as well. - -- cgit v1.2.3 From 8fffdc9353d64f7ceef5c4589249759561aa1b39 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:34 +0800 Subject: Documentation: x86: convert i386/IO-APIC.txt to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/i386/IO-APIC.rst | 123 +++++++++++++++++++++++++++++++++++++ Documentation/x86/i386/IO-APIC.txt | 119 ----------------------------------- Documentation/x86/i386/index.rst | 10 +++ Documentation/x86/index.rst | 1 + 4 files changed, 134 insertions(+), 119 deletions(-) create mode 100644 Documentation/x86/i386/IO-APIC.rst delete mode 100644 Documentation/x86/i386/IO-APIC.txt create mode 100644 Documentation/x86/i386/index.rst (limited to 'Documentation/x86') diff --git a/Documentation/x86/i386/IO-APIC.rst b/Documentation/x86/i386/IO-APIC.rst new file mode 100644 index 000000000000..ce4d8df15e7c --- /dev/null +++ b/Documentation/x86/i386/IO-APIC.rst @@ -0,0 +1,123 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======= +IO-APIC +======= + +:Author: Ingo Molnar + +Most (all) Intel-MP compliant SMP boards have the so-called 'IO-APIC', +which is an enhanced interrupt controller. It enables us to route +hardware interrupts to multiple CPUs, or to CPU groups. Without an +IO-APIC, interrupts from hardware will be delivered only to the +CPU which boots the operating system (usually CPU#0). + +Linux supports all variants of compliant SMP boards, including ones with +multiple IO-APICs. Multiple IO-APICs are used in high-end servers to +distribute IRQ load further. + +There are (a few) known breakages in certain older boards, such bugs are +usually worked around by the kernel. If your MP-compliant SMP board does +not boot Linux, then consult the linux-smp mailing list archives first. + +If your box boots fine with enabled IO-APIC IRQs, then your +/proc/interrupts will look like this one:: + + hell:~> cat /proc/interrupts + CPU0 + 0: 1360293 IO-APIC-edge timer + 1: 4 IO-APIC-edge keyboard + 2: 0 XT-PIC cascade + 13: 1 XT-PIC fpu + 14: 1448 IO-APIC-edge ide0 + 16: 28232 IO-APIC-level Intel EtherExpress Pro 10/100 Ethernet + 17: 51304 IO-APIC-level eth0 + NMI: 0 + ERR: 0 + hell:~> + +Some interrupts are still listed as 'XT PIC', but this is not a problem; +none of those IRQ sources is performance-critical. + + +In the unlikely case that your board does not create a working mp-table, +you can use the pirq= boot parameter to 'hand-construct' IRQ entries. This +is non-trivial though and cannot be automated. One sample /etc/lilo.conf +entry:: + + append="pirq=15,11,10" + +The actual numbers depend on your system, on your PCI cards and on their +PCI slot position. Usually PCI slots are 'daisy chained' before they are +connected to the PCI chipset IRQ routing facility (the incoming PIRQ1-4 +lines):: + + ,-. ,-. ,-. ,-. ,-. + PIRQ4 ----| |-. ,-| |-. ,-| |-. ,-| |--------| | + |S| \ / |S| \ / |S| \ / |S| |S| + PIRQ3 ----|l|-. `/---|l|-. `/---|l|-. `/---|l|--------|l| + |o| \/ |o| \/ |o| \/ |o| |o| + PIRQ2 ----|t|-./`----|t|-./`----|t|-./`----|t|--------|t| + |1| /\ |2| /\ |3| /\ |4| |5| + PIRQ1 ----| |- `----| |- `----| |- `----| |--------| | + `-' `-' `-' `-' `-' + +Every PCI card emits a PCI IRQ, which can be INTA, INTB, INTC or INTD:: + + ,-. + INTD--| | + |S| + INTC--|l| + |o| + INTB--|t| + |x| + INTA--| | + `-' + +These INTA-D PCI IRQs are always 'local to the card', their real meaning +depends on which slot they are in. If you look at the daisy chaining diagram, +a card in slot4, issuing INTA IRQ, it will end up as a signal on PIRQ4 of +the PCI chipset. Most cards issue INTA, this creates optimal distribution +between the PIRQ lines. (distributing IRQ sources properly is not a +necessity, PCI IRQs can be shared at will, but it's a good for performance +to have non shared interrupts). Slot5 should be used for videocards, they +do not use interrupts normally, thus they are not daisy chained either. + +so if you have your SCSI card (IRQ11) in Slot1, Tulip card (IRQ9) in +Slot2, then you'll have to specify this pirq= line:: + + append="pirq=11,9" + +the following script tries to figure out such a default pirq= line from +your PCI configuration:: + + echo -n pirq=; echo `scanpci | grep T_L | cut -c56-` | sed 's/ /,/g' + +note that this script won't work if you have skipped a few slots or if your +board does not do default daisy-chaining. (or the IO-APIC has the PIRQ pins +connected in some strange way). E.g. if in the above case you have your SCSI +card (IRQ11) in Slot3, and have Slot1 empty:: + + append="pirq=0,9,11" + +[value '0' is a generic 'placeholder', reserved for empty (or non-IRQ emitting) +slots.] + +Generally, it's always possible to find out the correct pirq= settings, just +permute all IRQ numbers properly ... it will take some time though. An +'incorrect' pirq line will cause the booting process to hang, or a device +won't function properly (e.g. if it's inserted as a module). + +If you have 2 PCI buses, then you can use up to 8 pirq values, although such +boards tend to have a good configuration. + +Be prepared that it might happen that you need some strange pirq line:: + + append="pirq=0,0,0,0,0,0,9,11" + +Use smart trial-and-error techniques to find out the correct pirq line ... + +Good luck and mail to linux-smp@vger.kernel.org or +linux-kernel@vger.kernel.org if you have any problems that are not covered +by this document. + diff --git a/Documentation/x86/i386/IO-APIC.txt b/Documentation/x86/i386/IO-APIC.txt deleted file mode 100644 index 15f5baf7e1b6..000000000000 --- a/Documentation/x86/i386/IO-APIC.txt +++ /dev/null @@ -1,119 +0,0 @@ -Most (all) Intel-MP compliant SMP boards have the so-called 'IO-APIC', -which is an enhanced interrupt controller. It enables us to route -hardware interrupts to multiple CPUs, or to CPU groups. Without an -IO-APIC, interrupts from hardware will be delivered only to the -CPU which boots the operating system (usually CPU#0). - -Linux supports all variants of compliant SMP boards, including ones with -multiple IO-APICs. Multiple IO-APICs are used in high-end servers to -distribute IRQ load further. - -There are (a few) known breakages in certain older boards, such bugs are -usually worked around by the kernel. If your MP-compliant SMP board does -not boot Linux, then consult the linux-smp mailing list archives first. - -If your box boots fine with enabled IO-APIC IRQs, then your -/proc/interrupts will look like this one: - - ----------------------------> - hell:~> cat /proc/interrupts - CPU0 - 0: 1360293 IO-APIC-edge timer - 1: 4 IO-APIC-edge keyboard - 2: 0 XT-PIC cascade - 13: 1 XT-PIC fpu - 14: 1448 IO-APIC-edge ide0 - 16: 28232 IO-APIC-level Intel EtherExpress Pro 10/100 Ethernet - 17: 51304 IO-APIC-level eth0 - NMI: 0 - ERR: 0 - hell:~> - <---------------------------- - -Some interrupts are still listed as 'XT PIC', but this is not a problem; -none of those IRQ sources is performance-critical. - - -In the unlikely case that your board does not create a working mp-table, -you can use the pirq= boot parameter to 'hand-construct' IRQ entries. This -is non-trivial though and cannot be automated. One sample /etc/lilo.conf -entry: - - append="pirq=15,11,10" - -The actual numbers depend on your system, on your PCI cards and on their -PCI slot position. Usually PCI slots are 'daisy chained' before they are -connected to the PCI chipset IRQ routing facility (the incoming PIRQ1-4 -lines): - - ,-. ,-. ,-. ,-. ,-. - PIRQ4 ----| |-. ,-| |-. ,-| |-. ,-| |--------| | - |S| \ / |S| \ / |S| \ / |S| |S| - PIRQ3 ----|l|-. `/---|l|-. `/---|l|-. `/---|l|--------|l| - |o| \/ |o| \/ |o| \/ |o| |o| - PIRQ2 ----|t|-./`----|t|-./`----|t|-./`----|t|--------|t| - |1| /\ |2| /\ |3| /\ |4| |5| - PIRQ1 ----| |- `----| |- `----| |- `----| |--------| | - `-' `-' `-' `-' `-' - -Every PCI card emits a PCI IRQ, which can be INTA, INTB, INTC or INTD: - - ,-. - INTD--| | - |S| - INTC--|l| - |o| - INTB--|t| - |x| - INTA--| | - `-' - -These INTA-D PCI IRQs are always 'local to the card', their real meaning -depends on which slot they are in. If you look at the daisy chaining diagram, -a card in slot4, issuing INTA IRQ, it will end up as a signal on PIRQ4 of -the PCI chipset. Most cards issue INTA, this creates optimal distribution -between the PIRQ lines. (distributing IRQ sources properly is not a -necessity, PCI IRQs can be shared at will, but it's a good for performance -to have non shared interrupts). Slot5 should be used for videocards, they -do not use interrupts normally, thus they are not daisy chained either. - -so if you have your SCSI card (IRQ11) in Slot1, Tulip card (IRQ9) in -Slot2, then you'll have to specify this pirq= line: - - append="pirq=11,9" - -the following script tries to figure out such a default pirq= line from -your PCI configuration: - - echo -n pirq=; echo `scanpci | grep T_L | cut -c56-` | sed 's/ /,/g' - -note that this script won't work if you have skipped a few slots or if your -board does not do default daisy-chaining. (or the IO-APIC has the PIRQ pins -connected in some strange way). E.g. if in the above case you have your SCSI -card (IRQ11) in Slot3, and have Slot1 empty: - - append="pirq=0,9,11" - -[value '0' is a generic 'placeholder', reserved for empty (or non-IRQ emitting) -slots.] - -Generally, it's always possible to find out the correct pirq= settings, just -permute all IRQ numbers properly ... it will take some time though. An -'incorrect' pirq line will cause the booting process to hang, or a device -won't function properly (e.g. if it's inserted as a module). - -If you have 2 PCI buses, then you can use up to 8 pirq values, although such -boards tend to have a good configuration. - -Be prepared that it might happen that you need some strange pirq line: - - append="pirq=0,0,0,0,0,0,9,11" - -Use smart trial-and-error techniques to find out the correct pirq line ... - -Good luck and mail to linux-smp@vger.kernel.org or -linux-kernel@vger.kernel.org if you have any problems that are not covered -by this document. - --- mingo - diff --git a/Documentation/x86/i386/index.rst b/Documentation/x86/i386/index.rst new file mode 100644 index 000000000000..8747cf5bbd49 --- /dev/null +++ b/Documentation/x86/i386/index.rst @@ -0,0 +1,10 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============ +i386 Support +============ + +.. toctree:: + :maxdepth: 2 + + IO-APIC diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index 3eb0334ae2d4..4e15bcc6456c 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -26,3 +26,4 @@ x86-specific Documentation microcode resctrl_ui usb-legacy-support + i386/index -- cgit v1.2.3 From bbea90bbb6c849b62b80ec4a83d10e64dd901bd9 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:35 +0800 Subject: Documentation: x86: convert x86_64/boot-options.txt to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/index.rst | 1 + Documentation/x86/x86_64/boot-options.rst | 335 ++++++++++++++++++++++++++++++ Documentation/x86/x86_64/boot-options.txt | 278 ------------------------- Documentation/x86/x86_64/index.rst | 10 + 4 files changed, 346 insertions(+), 278 deletions(-) create mode 100644 Documentation/x86/x86_64/boot-options.rst delete mode 100644 Documentation/x86/x86_64/boot-options.txt create mode 100644 Documentation/x86/x86_64/index.rst (limited to 'Documentation/x86') diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst index 4e15bcc6456c..73a487957fd4 100644 --- a/Documentation/x86/index.rst +++ b/Documentation/x86/index.rst @@ -27,3 +27,4 @@ x86-specific Documentation resctrl_ui usb-legacy-support i386/index + x86_64/index diff --git a/Documentation/x86/x86_64/boot-options.rst b/Documentation/x86/x86_64/boot-options.rst new file mode 100644 index 000000000000..2f69836b8445 --- /dev/null +++ b/Documentation/x86/x86_64/boot-options.rst @@ -0,0 +1,335 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +AMD64 Specific Boot Options +=========================== + +There are many others (usually documented in driver documentation), but +only the AMD64 specific ones are listed here. + +Machine check +============= +Please see Documentation/x86/x86_64/machinecheck for sysfs runtime tunables. + + mce=off + Disable machine check + mce=no_cmci + Disable CMCI(Corrected Machine Check Interrupt) that + Intel processor supports. Usually this disablement is + not recommended, but it might be handy if your hardware + is misbehaving. + Note that you'll get more problems without CMCI than with + due to the shared banks, i.e. you might get duplicated + error logs. + mce=dont_log_ce + Don't make logs for corrected errors. All events reported + as corrected are silently cleared by OS. + This option will be useful if you have no interest in any + of corrected errors. + mce=ignore_ce + Disable features for corrected errors, e.g. polling timer + and CMCI. All events reported as corrected are not cleared + by OS and remained in its error banks. + Usually this disablement is not recommended, however if + there is an agent checking/clearing corrected errors + (e.g. BIOS or hardware monitoring applications), conflicting + with OS's error handling, and you cannot deactivate the agent, + then this option will be a help. + mce=no_lmce + Do not opt-in to Local MCE delivery. Use legacy method + to broadcast MCEs. + mce=bootlog + Enable logging of machine checks left over from booting. + Disabled by default on AMD Fam10h and older because some BIOS + leave bogus ones. + If your BIOS doesn't do that it's a good idea to enable though + to make sure you log even machine check events that result + in a reboot. On Intel systems it is enabled by default. + mce=nobootlog + Disable boot machine check logging. + mce=tolerancelevel[,monarchtimeout] (number,number) + tolerance levels: + 0: always panic on uncorrected errors, log corrected errors + 1: panic or SIGBUS on uncorrected errors, log corrected errors + 2: SIGBUS or log uncorrected errors, log corrected errors + 3: never panic or SIGBUS, log all errors (for testing only) + Default is 1 + Can be also set using sysfs which is preferable. + monarchtimeout: + Sets the time in us to wait for other CPUs on machine checks. 0 + to disable. + mce=bios_cmci_threshold + Don't overwrite the bios-set CMCI threshold. This boot option + prevents Linux from overwriting the CMCI threshold set by the + bios. Without this option, Linux always sets the CMCI + threshold to 1. Enabling this may make memory predictive failure + analysis less effective if the bios sets thresholds for memory + errors since we will not see details for all errors. + mce=recovery + Force-enable recoverable machine check code paths + + nomce (for compatibility with i386) + same as mce=off + + Everything else is in sysfs now. + +APICs +===== + + apic + Use IO-APIC. Default + + noapic + Don't use the IO-APIC. + + disableapic + Don't use the local APIC + + nolapic + Don't use the local APIC (alias for i386 compatibility) + + pirq=... + See Documentation/x86/i386/IO-APIC.txt + + noapictimer + Don't set up the APIC timer + + no_timer_check + Don't check the IO-APIC timer. This can work around + problems with incorrect timer initialization on some boards. + + apicpmtimer + Do APIC timer calibration using the pmtimer. Implies + apicmaintimer. Useful when your PIT timer is totally broken. + +Timing +====== + + notsc + Deprecated, use tsc=unstable instead. + + nohpet + Don't use the HPET timer. + +Idle loop +========= + + idle=poll + Don't do power saving in the idle loop using HLT, but poll for rescheduling + event. This will make the CPUs eat a lot more power, but may be useful + to get slightly better performance in multiprocessor benchmarks. It also + makes some profiling using performance counters more accurate. + Please note that on systems with MONITOR/MWAIT support (like Intel EM64T + CPUs) this option has no performance advantage over the normal idle loop. + It may also interact badly with hyperthreading. + +Rebooting +========= + + reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] [, [w]arm | [c]old] + bios + Use the CPU reboot vector for warm reset + warm + Don't set the cold reboot flag + cold + Set the cold reboot flag + triple + Force a triple fault (init) + kbd + Use the keyboard controller. cold reset (default) + acpi + Use the ACPI RESET_REG in the FADT. If ACPI is not configured or + the ACPI reset does not work, the reboot path attempts the reset + using the keyboard controller. + efi + Use efi reset_system runtime service. If EFI is not configured or + the EFI reset does not work, the reboot path attempts the reset using + the keyboard controller. + + Using warm reset will be much faster especially on big memory + systems because the BIOS will not go through the memory check. + Disadvantage is that not all hardware will be completely reinitialized + on reboot so there may be boot problems on some systems. + + reboot=force + Don't stop other CPUs on reboot. This can make reboot more reliable + in some cases. + +Non Executable Mappings +======================= + + noexec=on|off + on + Enable(default) + off + Disable + +NUMA +==== + + numa=off + Only set up a single NUMA node spanning all memory. + + numa=noacpi + Don't parse the SRAT table for NUMA setup + + numa=fake=[MG] + If given as a memory unit, fills all system RAM with nodes of + size interleaved over physical nodes. + + numa=fake= + If given as an integer, fills all system RAM with N fake nodes + interleaved over physical nodes. + + numa=fake=U + If given as an integer followed by 'U', it will divide each + physical node into N emulated nodes. + +ACPI +==== + + acpi=off + Don't enable ACPI + acpi=ht + Use ACPI boot table parsing, but don't enable ACPI interpreter + acpi=force + Force ACPI on (currently not needed) + acpi=strict + Disable out of spec ACPI workarounds. + acpi_sci={edge,level,high,low} + Set up ACPI SCI interrupt. + acpi=noirq + Don't route interrupts + acpi=nocmcff + Disable firmware first mode for corrected errors. This + disables parsing the HEST CMC error source to check if + firmware has set the FF flag. This may result in + duplicate corrected error reports. + +PCI +=== + + pci=off + Don't use PCI + pci=conf1 + Use conf1 access. + pci=conf2 + Use conf2 access. + pci=rom + Assign ROMs. + pci=assign-busses + Assign busses + pci=irqmask=MASK + Set PCI interrupt mask to MASK + pci=lastbus=NUMBER + Scan up to NUMBER busses, no matter what the mptable says. + pci=noacpi + Don't use ACPI to set up PCI interrupt routing. + +IOMMU (input/output memory management unit) +=========================================== +Multiple x86-64 PCI-DMA mapping implementations exist, for example: + + 1. : use no hardware/software IOMMU at all + (e.g. because you have < 3 GB memory). + Kernel boot message: "PCI-DMA: Disabling IOMMU" + + 2. : AMD GART based hardware IOMMU. + Kernel boot message: "PCI-DMA: using GART IOMMU" + + 3. : Software IOMMU implementation. Used + e.g. if there is no hardware IOMMU in the system and it is need because + you have >3GB memory or told the kernel to us it (iommu=soft)) + Kernel boot message: "PCI-DMA: Using software bounce buffering + for IO (SWIOTLB)" + + 4. : IBM Calgary hardware IOMMU. Used in IBM + pSeries and xSeries servers. This hardware IOMMU supports DMA address + mapping with memory protection, etc. + Kernel boot message: "PCI-DMA: Using Calgary IOMMU" + +:: + + iommu=[][,noagp][,off][,force][,noforce] + [,memaper[=]][,merge][,fullflush][,nomerge] + [,noaperture][,calgary] + +General iommu options: + + off + Don't initialize and use any kind of IOMMU. + noforce + Don't force hardware IOMMU usage when it is not needed. (default). + force + Force the use of the hardware IOMMU even when it is + not actually needed (e.g. because < 3 GB memory). + soft + Use software bounce buffering (SWIOTLB) (default for + Intel machines). This can be used to prevent the usage + of an available hardware IOMMU. + +iommu options only relevant to the AMD GART hardware IOMMU: + + + Set the size of the remapping area in bytes. + allowed + Overwrite iommu off workarounds for specific chipsets. + fullflush + Flush IOMMU on each allocation (default). + nofullflush + Don't use IOMMU fullflush. + memaper[=] + Allocate an own aperture over RAM with size 32MB<[,force] + + Prereserve that many 128K pages for the software IO bounce buffering. + force + Force all IO through the software TLB. + +Settings for the IBM Calgary hardware IOMMU currently found in IBM +pSeries and xSeries machines + + calgary=[64k,128k,256k,512k,1M,2M,4M,8M] + Set the size of each PCI slot's translation table when using the + Calgary IOMMU. This is the size of the translation table itself + in main memory. The smallest table, 64k, covers an IO space of + 32MB; the largest, 8MB table, can cover an IO space of 4GB. + Normally the kernel will make the right choice by itself. + calgary=[translate_empty_slots] + Enable translation even on slots that have no devices attached to + them, in case a device will be hotplugged in the future. + calgary=[disable=] + Disable translation on a given PHB. For + example, the built-in graphics adapter resides on the first bridge + (PCI bus number 0); if translation (isolation) is enabled on this + bridge, X servers that access the hardware directly from user + space might stop working. Use this option if you have devices that + are accessed from userspace directly on some PCI host bridge. + panic + Always panic when IOMMU overflows + + +Miscellaneous +============= + + nogbpages + Do not use GB pages for kernel direct mappings. + gbpages + Use GB pages for kernel direct mappings. diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt deleted file mode 100644 index abc53886655e..000000000000 --- a/Documentation/x86/x86_64/boot-options.txt +++ /dev/null @@ -1,278 +0,0 @@ -AMD64 specific boot options - -There are many others (usually documented in driver documentation), but -only the AMD64 specific ones are listed here. - -Machine check - - Please see Documentation/x86/x86_64/machinecheck for sysfs runtime tunables. - - mce=off - Disable machine check - mce=no_cmci - Disable CMCI(Corrected Machine Check Interrupt) that - Intel processor supports. Usually this disablement is - not recommended, but it might be handy if your hardware - is misbehaving. - Note that you'll get more problems without CMCI than with - due to the shared banks, i.e. you might get duplicated - error logs. - mce=dont_log_ce - Don't make logs for corrected errors. All events reported - as corrected are silently cleared by OS. - This option will be useful if you have no interest in any - of corrected errors. - mce=ignore_ce - Disable features for corrected errors, e.g. polling timer - and CMCI. All events reported as corrected are not cleared - by OS and remained in its error banks. - Usually this disablement is not recommended, however if - there is an agent checking/clearing corrected errors - (e.g. BIOS or hardware monitoring applications), conflicting - with OS's error handling, and you cannot deactivate the agent, - then this option will be a help. - mce=no_lmce - Do not opt-in to Local MCE delivery. Use legacy method - to broadcast MCEs. - mce=bootlog - Enable logging of machine checks left over from booting. - Disabled by default on AMD Fam10h and older because some BIOS - leave bogus ones. - If your BIOS doesn't do that it's a good idea to enable though - to make sure you log even machine check events that result - in a reboot. On Intel systems it is enabled by default. - mce=nobootlog - Disable boot machine check logging. - mce=tolerancelevel[,monarchtimeout] (number,number) - tolerance levels: - 0: always panic on uncorrected errors, log corrected errors - 1: panic or SIGBUS on uncorrected errors, log corrected errors - 2: SIGBUS or log uncorrected errors, log corrected errors - 3: never panic or SIGBUS, log all errors (for testing only) - Default is 1 - Can be also set using sysfs which is preferable. - monarchtimeout: - Sets the time in us to wait for other CPUs on machine checks. 0 - to disable. - mce=bios_cmci_threshold - Don't overwrite the bios-set CMCI threshold. This boot option - prevents Linux from overwriting the CMCI threshold set by the - bios. Without this option, Linux always sets the CMCI - threshold to 1. Enabling this may make memory predictive failure - analysis less effective if the bios sets thresholds for memory - errors since we will not see details for all errors. - mce=recovery - Force-enable recoverable machine check code paths - - nomce (for compatibility with i386): same as mce=off - - Everything else is in sysfs now. - -APICs - - apic Use IO-APIC. Default - - noapic Don't use the IO-APIC. - - disableapic Don't use the local APIC - - nolapic Don't use the local APIC (alias for i386 compatibility) - - pirq=... See Documentation/x86/i386/IO-APIC.txt - - noapictimer Don't set up the APIC timer - - no_timer_check Don't check the IO-APIC timer. This can work around - problems with incorrect timer initialization on some boards. - apicpmtimer - Do APIC timer calibration using the pmtimer. Implies - apicmaintimer. Useful when your PIT timer is totally - broken. - -Timing - - notsc - Deprecated, use tsc=unstable instead. - - nohpet - Don't use the HPET timer. - -Idle loop - - idle=poll - Don't do power saving in the idle loop using HLT, but poll for rescheduling - event. This will make the CPUs eat a lot more power, but may be useful - to get slightly better performance in multiprocessor benchmarks. It also - makes some profiling using performance counters more accurate. - Please note that on systems with MONITOR/MWAIT support (like Intel EM64T - CPUs) this option has no performance advantage over the normal idle loop. - It may also interact badly with hyperthreading. - -Rebooting - - reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] [, [w]arm | [c]old] - bios Use the CPU reboot vector for warm reset - warm Don't set the cold reboot flag - cold Set the cold reboot flag - triple Force a triple fault (init) - kbd Use the keyboard controller. cold reset (default) - acpi Use the ACPI RESET_REG in the FADT. If ACPI is not configured or the - ACPI reset does not work, the reboot path attempts the reset using - the keyboard controller. - efi Use efi reset_system runtime service. If EFI is not configured or the - EFI reset does not work, the reboot path attempts the reset using - the keyboard controller. - - Using warm reset will be much faster especially on big memory - systems because the BIOS will not go through the memory check. - Disadvantage is that not all hardware will be completely reinitialized - on reboot so there may be boot problems on some systems. - - reboot=force - - Don't stop other CPUs on reboot. This can make reboot more reliable - in some cases. - -Non Executable Mappings - - noexec=on|off - - on Enable(default) - off Disable - -NUMA - - numa=off Only set up a single NUMA node spanning all memory. - - numa=noacpi Don't parse the SRAT table for NUMA setup - - numa=fake=[MG] - If given as a memory unit, fills all system RAM with nodes of - size interleaved over physical nodes. - - numa=fake= - If given as an integer, fills all system RAM with N fake nodes - interleaved over physical nodes. - - numa=fake=U - If given as an integer followed by 'U', it will divide each - physical node into N emulated nodes. - -ACPI - - acpi=off Don't enable ACPI - acpi=ht Use ACPI boot table parsing, but don't enable ACPI - interpreter - acpi=force Force ACPI on (currently not needed) - - acpi=strict Disable out of spec ACPI workarounds. - - acpi_sci={edge,level,high,low} Set up ACPI SCI interrupt. - - acpi=noirq Don't route interrupts - - acpi=nocmcff Disable firmware first mode for corrected errors. This - disables parsing the HEST CMC error source to check if - firmware has set the FF flag. This may result in - duplicate corrected error reports. - -PCI - - pci=off Don't use PCI - pci=conf1 Use conf1 access. - pci=conf2 Use conf2 access. - pci=rom Assign ROMs. - pci=assign-busses Assign busses - pci=irqmask=MASK Set PCI interrupt mask to MASK - pci=lastbus=NUMBER Scan up to NUMBER busses, no matter what the mptable says. - pci=noacpi Don't use ACPI to set up PCI interrupt routing. - -IOMMU (input/output memory management unit) - - Multiple x86-64 PCI-DMA mapping implementations exist, for example: - - 1. : use no hardware/software IOMMU at all - (e.g. because you have < 3 GB memory). - Kernel boot message: "PCI-DMA: Disabling IOMMU" - - 2. : AMD GART based hardware IOMMU. - Kernel boot message: "PCI-DMA: using GART IOMMU" - - 3. : Software IOMMU implementation. Used - e.g. if there is no hardware IOMMU in the system and it is need because - you have >3GB memory or told the kernel to us it (iommu=soft)) - Kernel boot message: "PCI-DMA: Using software bounce buffering - for IO (SWIOTLB)" - - 4. : IBM Calgary hardware IOMMU. Used in IBM - pSeries and xSeries servers. This hardware IOMMU supports DMA address - mapping with memory protection, etc. - Kernel boot message: "PCI-DMA: Using Calgary IOMMU" - - iommu=[][,noagp][,off][,force][,noforce] - [,memaper[=]][,merge][,fullflush][,nomerge] - [,noaperture][,calgary] - - General iommu options: - off Don't initialize and use any kind of IOMMU. - noforce Don't force hardware IOMMU usage when it is not needed. - (default). - force Force the use of the hardware IOMMU even when it is - not actually needed (e.g. because < 3 GB memory). - soft Use software bounce buffering (SWIOTLB) (default for - Intel machines). This can be used to prevent the usage - of an available hardware IOMMU. - - iommu options only relevant to the AMD GART hardware IOMMU: - Set the size of the remapping area in bytes. - allowed Overwrite iommu off workarounds for specific chipsets. - fullflush Flush IOMMU on each allocation (default). - nofullflush Don't use IOMMU fullflush. - memaper[=] Allocate an own aperture over RAM with size 32MB<[,force] - Prereserve that many 128K pages for the software IO - bounce buffering. - force Force all IO through the software TLB. - - Settings for the IBM Calgary hardware IOMMU currently found in IBM - pSeries and xSeries machines: - - calgary=[64k,128k,256k,512k,1M,2M,4M,8M] - calgary=[translate_empty_slots] - calgary=[disable=] - panic Always panic when IOMMU overflows - - 64k,...,8M - Set the size of each PCI slot's translation table - when using the Calgary IOMMU. This is the size of the translation - table itself in main memory. The smallest table, 64k, covers an IO - space of 32MB; the largest, 8MB table, can cover an IO space of - 4GB. Normally the kernel will make the right choice by itself. - - translate_empty_slots - Enable translation even on slots that have - no devices attached to them, in case a device will be hotplugged - in the future. - - disable= - Disable translation on a given PHB. For - example, the built-in graphics adapter resides on the first bridge - (PCI bus number 0); if translation (isolation) is enabled on this - bridge, X servers that access the hardware directly from user - space might stop working. Use this option if you have devices that - are accessed from userspace directly on some PCI host bridge. - -Miscellaneous - - nogbpages - Do not use GB pages for kernel direct mappings. - gbpages - Use GB pages for kernel direct mappings. diff --git a/Documentation/x86/x86_64/index.rst b/Documentation/x86/x86_64/index.rst new file mode 100644 index 000000000000..a8cf7713cac9 --- /dev/null +++ b/Documentation/x86/x86_64/index.rst @@ -0,0 +1,10 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============== +x86_64 Support +============== + +.. toctree:: + :maxdepth: 2 + + boot-options -- cgit v1.2.3 From 1c65b4e0f27f3688d95de59cbdfa076270251f3c Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:36 +0800 Subject: Documentation: x86: convert x86_64/uefi.txt to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/x86_64/index.rst | 1 + Documentation/x86/x86_64/uefi.rst | 58 ++++++++++++++++++++++++++++++++++++++ Documentation/x86/x86_64/uefi.txt | 42 --------------------------- 3 files changed, 59 insertions(+), 42 deletions(-) create mode 100644 Documentation/x86/x86_64/uefi.rst delete mode 100644 Documentation/x86/x86_64/uefi.txt (limited to 'Documentation/x86') diff --git a/Documentation/x86/x86_64/index.rst b/Documentation/x86/x86_64/index.rst index a8cf7713cac9..ddfa1f9d4193 100644 --- a/Documentation/x86/x86_64/index.rst +++ b/Documentation/x86/x86_64/index.rst @@ -8,3 +8,4 @@ x86_64 Support :maxdepth: 2 boot-options + uefi diff --git a/Documentation/x86/x86_64/uefi.rst b/Documentation/x86/x86_64/uefi.rst new file mode 100644 index 000000000000..88c3ba32546f --- /dev/null +++ b/Documentation/x86/x86_64/uefi.rst @@ -0,0 +1,58 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================================== +General note on [U]EFI x86_64 support +===================================== + +The nomenclature EFI and UEFI are used interchangeably in this document. + +Although the tools below are _not_ needed for building the kernel, +the needed bootloader support and associated tools for x86_64 platforms +with EFI firmware and specifications are listed below. + +1. UEFI specification: http://www.uefi.org + +2. Booting Linux kernel on UEFI x86_64 platform requires bootloader + support. Elilo with x86_64 support can be used. + +3. x86_64 platform with EFI/UEFI firmware. + +Mechanics +--------- + +- Build the kernel with the following configuration:: + + CONFIG_FB_EFI=y + CONFIG_FRAMEBUFFER_CONSOLE=y + + If EFI runtime services are expected, the following configuration should + be selected:: + + CONFIG_EFI=y + CONFIG_EFI_VARS=y or m # optional + +- Create a VFAT partition on the disk +- Copy the following to the VFAT partition: + + elilo bootloader with x86_64 support, elilo configuration file, + kernel image built in first step and corresponding + initrd. Instructions on building elilo and its dependencies + can be found in the elilo sourceforge project. + +- Boot to EFI shell and invoke elilo choosing the kernel image built + in first step. +- If some or all EFI runtime services don't work, you can try following + kernel command line parameters to turn off some or all EFI runtime + services. + + noefi + turn off all EFI runtime services + reboot_type=k + turn off EFI reboot runtime service + +- If the EFI memory map has additional entries not in the E820 map, + you can include those entries in the kernels memory map of available + physical RAM by using the following kernel command line parameter. + + add_efi_memmap + include EFI memory map of available physical RAM diff --git a/Documentation/x86/x86_64/uefi.txt b/Documentation/x86/x86_64/uefi.txt deleted file mode 100644 index a5e2b4fdb170..000000000000 --- a/Documentation/x86/x86_64/uefi.txt +++ /dev/null @@ -1,42 +0,0 @@ -General note on [U]EFI x86_64 support -------------------------------------- - -The nomenclature EFI and UEFI are used interchangeably in this document. - -Although the tools below are _not_ needed for building the kernel, -the needed bootloader support and associated tools for x86_64 platforms -with EFI firmware and specifications are listed below. - -1. UEFI specification: http://www.uefi.org - -2. Booting Linux kernel on UEFI x86_64 platform requires bootloader - support. Elilo with x86_64 support can be used. - -3. x86_64 platform with EFI/UEFI firmware. - -Mechanics: ---------- -- Build the kernel with the following configuration. - CONFIG_FB_EFI=y - CONFIG_FRAMEBUFFER_CONSOLE=y - If EFI runtime services are expected, the following configuration should - be selected. - CONFIG_EFI=y - CONFIG_EFI_VARS=y or m # optional -- Create a VFAT partition on the disk -- Copy the following to the VFAT partition: - elilo bootloader with x86_64 support, elilo configuration file, - kernel image built in first step and corresponding - initrd. Instructions on building elilo and its dependencies - can be found in the elilo sourceforge project. -- Boot to EFI shell and invoke elilo choosing the kernel image built - in first step. -- If some or all EFI runtime services don't work, you can try following - kernel command line parameters to turn off some or all EFI runtime - services. - noefi turn off all EFI runtime services - reboot_type=k turn off EFI reboot runtime service -- If the EFI memory map has additional entries not in the E820 map, - you can include those entries in the kernels memory map of available - physical RAM by using the following kernel command line parameter. - add_efi_memmap include EFI memory map of available physical RAM -- cgit v1.2.3 From b88679d2f2b9e18618308bbe6d70a1fc91b6a35a Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:37 +0800 Subject: Documentation: x86: convert x86_64/mm.txt to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/x86_64/index.rst | 1 + Documentation/x86/x86_64/mm.rst | 161 +++++++++++++++++++++++++++++++++++++ Documentation/x86/x86_64/mm.txt | 153 ----------------------------------- 3 files changed, 162 insertions(+), 153 deletions(-) create mode 100644 Documentation/x86/x86_64/mm.rst delete mode 100644 Documentation/x86/x86_64/mm.txt (limited to 'Documentation/x86') diff --git a/Documentation/x86/x86_64/index.rst b/Documentation/x86/x86_64/index.rst index ddfa1f9d4193..4b65d29ef459 100644 --- a/Documentation/x86/x86_64/index.rst +++ b/Documentation/x86/x86_64/index.rst @@ -9,3 +9,4 @@ x86_64 Support boot-options uefi + mm diff --git a/Documentation/x86/x86_64/mm.rst b/Documentation/x86/x86_64/mm.rst new file mode 100644 index 000000000000..52020577b8de --- /dev/null +++ b/Documentation/x86/x86_64/mm.rst @@ -0,0 +1,161 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================ +Memory Managment +================ + +Complete virtual memory map with 4-level page tables +==================================================== + +.. note:: + + - Negative addresses such as "-23 TB" are absolute addresses in bytes, counted down + from the top of the 64-bit address space. It's easier to understand the layout + when seen both in absolute addresses and in distance-from-top notation. + + For example 0xffffe90000000000 == -23 TB, it's 23 TB lower than the top of the + 64-bit address space (ffffffffffffffff). + + Note that as we get closer to the top of the address space, the notation changes + from TB to GB and then MB/KB. + + - "16M TB" might look weird at first sight, but it's an easier to visualize size + notation than "16 EB", which few will recognize at first sight as 16 exabytes. + It also shows it nicely how incredibly large 64-bit address space is. + +:: + + ======================================================================================================================== + Start addr | Offset | End addr | Size | VM area description + ======================================================================================================================== + | | | | + 0000000000000000 | 0 | 00007fffffffffff | 128 TB | user-space virtual memory, different per mm + __________________|____________|__________________|_________|___________________________________________________________ + | | | | + 0000800000000000 | +128 TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical + | | | | virtual memory addresses up to the -128 TB + | | | | starting offset of kernel mappings. + __________________|____________|__________________|_________|___________________________________________________________ + | + | Kernel-space virtual memory, shared between all processes: + ____________________________________________________________|___________________________________________________________ + | | | | + ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor + ffff880000000000 | -120 TB | ffff887fffffffff | 0.5 TB | LDT remap for PTI + ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base) + ffffc88000000000 | -55.5 TB | ffffc8ffffffffff | 0.5 TB | ... unused hole + ffffc90000000000 | -55 TB | ffffe8ffffffffff | 32 TB | vmalloc/ioremap space (vmalloc_base) + ffffe90000000000 | -23 TB | ffffe9ffffffffff | 1 TB | ... unused hole + ffffea0000000000 | -22 TB | ffffeaffffffffff | 1 TB | virtual memory map (vmemmap_base) + ffffeb0000000000 | -21 TB | ffffebffffffffff | 1 TB | ... unused hole + ffffec0000000000 | -20 TB | fffffbffffffffff | 16 TB | KASAN shadow memory + __________________|____________|__________________|_________|____________________________________________________________ + | + | Identical layout to the 56-bit one from here on: + ____________________________________________________________|____________________________________________________________ + | | | | + fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole + | | | | vaddr_end for KASLR + fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping + fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole + ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks + ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole + ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space + ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole + ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0 + ffffffff80000000 |-2048 MB | | | + ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space + ffffffffff000000 | -16 MB | | | + FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset + ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI + ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole + __________________|____________|__________________|_________|___________________________________________________________ + + +Complete virtual memory map with 5-level page tables +==================================================== + +.. note:: + + - With 56-bit addresses, user-space memory gets expanded by a factor of 512x, + from 0.125 PB to 64 PB. All kernel mappings shift down to the -64 PT starting + offset and many of the regions expand to support the much larger physical + memory supported. + +:: + + ======================================================================================================================== + Start addr | Offset | End addr | Size | VM area description + ======================================================================================================================== + | | | | + 0000000000000000 | 0 | 00ffffffffffffff | 64 PB | user-space virtual memory, different per mm + __________________|____________|__________________|_________|___________________________________________________________ + | | | | + 0000800000000000 | +64 PB | ffff7fffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical + | | | | virtual memory addresses up to the -64 PB + | | | | starting offset of kernel mappings. + __________________|____________|__________________|_________|___________________________________________________________ + | + | Kernel-space virtual memory, shared between all processes: + ____________________________________________________________|___________________________________________________________ + | | | | + ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard hole, also reserved for hypervisor + ff10000000000000 | -60 PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI + ff11000000000000 | -59.75 PB | ff90ffffffffffff | 32 PB | direct mapping of all physical memory (page_offset_base) + ff91000000000000 | -27.75 PB | ff9fffffffffffff | 3.75 PB | ... unused hole + ffa0000000000000 | -24 PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base) + ffd2000000000000 | -11.5 PB | ffd3ffffffffffff | 0.5 PB | ... unused hole + ffd4000000000000 | -11 PB | ffd5ffffffffffff | 0.5 PB | virtual memory map (vmemmap_base) + ffd6000000000000 | -10.5 PB | ffdeffffffffffff | 2.25 PB | ... unused hole + ffdf000000000000 | -8.25 PB | fffffdffffffffff | ~8 PB | KASAN shadow memory + __________________|____________|__________________|_________|____________________________________________________________ + | + | Identical layout to the 47-bit one from here on: + ____________________________________________________________|____________________________________________________________ + | | | | + fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole + | | | | vaddr_end for KASLR + fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping + fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole + ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks + ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole + ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space + ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole + ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0 + ffffffff80000000 |-2048 MB | | | + ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space + ffffffffff000000 | -16 MB | | | + FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset + ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI + ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole + __________________|____________|__________________|_________|___________________________________________________________ + +Architecture defines a 64-bit virtual address. Implementations can support +less. Currently supported are 48- and 57-bit virtual addresses. Bits 63 +through to the most-significant implemented bit are sign extended. +This causes hole between user space and kernel addresses if you interpret them +as unsigned. + +The direct mapping covers all memory in the system up to the highest +memory address (this means in some cases it can also include PCI memory +holes). + +vmalloc space is lazily synchronized into the different PML4/PML5 pages of +the processes using the page fault handler, with init_top_pgt as +reference. + +We map EFI runtime services in the 'efi_pgd' PGD in a 64Gb large virtual +memory window (this size is arbitrary, it can be raised later if needed). +The mappings are not part of any other kernel PGD and are only available +during EFI runtime calls. + +Note that if CONFIG_RANDOMIZE_MEMORY is enabled, the direct mapping of all +physical memory, vmalloc/ioremap space and virtual memory map are randomized. +Their order is preserved but their base will be offset early at boot time. + +Be very careful vs. KASLR when changing anything here. The KASLR address +range must not overlap with anything except the KASAN shadow area, which is +correct as KASAN disables KASLR. + +For both 4- and 5-level layouts, the STACKLEAK_POISON value in the last 2MB +hole: ffffffffffff4111 diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt deleted file mode 100644 index 804f9426ed17..000000000000 --- a/Documentation/x86/x86_64/mm.txt +++ /dev/null @@ -1,153 +0,0 @@ -==================================================== -Complete virtual memory map with 4-level page tables -==================================================== - -Notes: - - - Negative addresses such as "-23 TB" are absolute addresses in bytes, counted down - from the top of the 64-bit address space. It's easier to understand the layout - when seen both in absolute addresses and in distance-from-top notation. - - For example 0xffffe90000000000 == -23 TB, it's 23 TB lower than the top of the - 64-bit address space (ffffffffffffffff). - - Note that as we get closer to the top of the address space, the notation changes - from TB to GB and then MB/KB. - - - "16M TB" might look weird at first sight, but it's an easier to visualize size - notation than "16 EB", which few will recognize at first sight as 16 exabytes. - It also shows it nicely how incredibly large 64-bit address space is. - -======================================================================================================================== - Start addr | Offset | End addr | Size | VM area description -======================================================================================================================== - | | | | - 0000000000000000 | 0 | 00007fffffffffff | 128 TB | user-space virtual memory, different per mm -__________________|____________|__________________|_________|___________________________________________________________ - | | | | - 0000800000000000 | +128 TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical - | | | | virtual memory addresses up to the -128 TB - | | | | starting offset of kernel mappings. -__________________|____________|__________________|_________|___________________________________________________________ - | - | Kernel-space virtual memory, shared between all processes: -____________________________________________________________|___________________________________________________________ - | | | | - ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor - ffff880000000000 | -120 TB | ffff887fffffffff | 0.5 TB | LDT remap for PTI - ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base) - ffffc88000000000 | -55.5 TB | ffffc8ffffffffff | 0.5 TB | ... unused hole - ffffc90000000000 | -55 TB | ffffe8ffffffffff | 32 TB | vmalloc/ioremap space (vmalloc_base) - ffffe90000000000 | -23 TB | ffffe9ffffffffff | 1 TB | ... unused hole - ffffea0000000000 | -22 TB | ffffeaffffffffff | 1 TB | virtual memory map (vmemmap_base) - ffffeb0000000000 | -21 TB | ffffebffffffffff | 1 TB | ... unused hole - ffffec0000000000 | -20 TB | fffffbffffffffff | 16 TB | KASAN shadow memory -__________________|____________|__________________|_________|____________________________________________________________ - | - | Identical layout to the 56-bit one from here on: -____________________________________________________________|____________________________________________________________ - | | | | - fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole - | | | | vaddr_end for KASLR - fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping - fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole - ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks - ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole - ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space - ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole - ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0 - ffffffff80000000 |-2048 MB | | | - ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space - ffffffffff000000 | -16 MB | | | - FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset - ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI - ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole -__________________|____________|__________________|_________|___________________________________________________________ - - -==================================================== -Complete virtual memory map with 5-level page tables -==================================================== - -Notes: - - - With 56-bit addresses, user-space memory gets expanded by a factor of 512x, - from 0.125 PB to 64 PB. All kernel mappings shift down to the -64 PT starting - offset and many of the regions expand to support the much larger physical - memory supported. - -======================================================================================================================== - Start addr | Offset | End addr | Size | VM area description -======================================================================================================================== - | | | | - 0000000000000000 | 0 | 00ffffffffffffff | 64 PB | user-space virtual memory, different per mm -__________________|____________|__________________|_________|___________________________________________________________ - | | | | - 0000800000000000 | +64 PB | ffff7fffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical - | | | | virtual memory addresses up to the -64 PB - | | | | starting offset of kernel mappings. -__________________|____________|__________________|_________|___________________________________________________________ - | - | Kernel-space virtual memory, shared between all processes: -____________________________________________________________|___________________________________________________________ - | | | | - ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard hole, also reserved for hypervisor - ff10000000000000 | -60 PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI - ff11000000000000 | -59.75 PB | ff90ffffffffffff | 32 PB | direct mapping of all physical memory (page_offset_base) - ff91000000000000 | -27.75 PB | ff9fffffffffffff | 3.75 PB | ... unused hole - ffa0000000000000 | -24 PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base) - ffd2000000000000 | -11.5 PB | ffd3ffffffffffff | 0.5 PB | ... unused hole - ffd4000000000000 | -11 PB | ffd5ffffffffffff | 0.5 PB | virtual memory map (vmemmap_base) - ffd6000000000000 | -10.5 PB | ffdeffffffffffff | 2.25 PB | ... unused hole - ffdf000000000000 | -8.25 PB | fffffdffffffffff | ~8 PB | KASAN shadow memory -__________________|____________|__________________|_________|____________________________________________________________ - | - | Identical layout to the 47-bit one from here on: -____________________________________________________________|____________________________________________________________ - | | | | - fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole - | | | | vaddr_end for KASLR - fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping - fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole - ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks - ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole - ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space - ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole - ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0 - ffffffff80000000 |-2048 MB | | | - ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space - ffffffffff000000 | -16 MB | | | - FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset - ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI - ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole -__________________|____________|__________________|_________|___________________________________________________________ - -Architecture defines a 64-bit virtual address. Implementations can support -less. Currently supported are 48- and 57-bit virtual addresses. Bits 63 -through to the most-significant implemented bit are sign extended. -This causes hole between user space and kernel addresses if you interpret them -as unsigned. - -The direct mapping covers all memory in the system up to the highest -memory address (this means in some cases it can also include PCI memory -holes). - -vmalloc space is lazily synchronized into the different PML4/PML5 pages of -the processes using the page fault handler, with init_top_pgt as -reference. - -We map EFI runtime services in the 'efi_pgd' PGD in a 64Gb large virtual -memory window (this size is arbitrary, it can be raised later if needed). -The mappings are not part of any other kernel PGD and are only available -during EFI runtime calls. - -Note that if CONFIG_RANDOMIZE_MEMORY is enabled, the direct mapping of all -physical memory, vmalloc/ioremap space and virtual memory map are randomized. -Their order is preserved but their base will be offset early at boot time. - -Be very careful vs. KASLR when changing anything here. The KASLR address -range must not overlap with anything except the KASAN shadow area, which is -correct as KASAN disables KASLR. - -For both 4- and 5-level layouts, the STACKLEAK_POISON value in the last 2MB -hole: ffffffffffff4111 -- cgit v1.2.3 From 85a3bd41cd680aacf0b88e78bbfcfb5bc13292ff Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:38 +0800 Subject: Documentation: x86: convert x86_64/5level-paging.txt to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/x86_64/5level-paging.rst | 67 ++++++++++++++++++++++++++++++ Documentation/x86/x86_64/5level-paging.txt | 61 --------------------------- Documentation/x86/x86_64/index.rst | 1 + 3 files changed, 68 insertions(+), 61 deletions(-) create mode 100644 Documentation/x86/x86_64/5level-paging.rst delete mode 100644 Documentation/x86/x86_64/5level-paging.txt (limited to 'Documentation/x86') diff --git a/Documentation/x86/x86_64/5level-paging.rst b/Documentation/x86/x86_64/5level-paging.rst new file mode 100644 index 000000000000..ab88a4514163 --- /dev/null +++ b/Documentation/x86/x86_64/5level-paging.rst @@ -0,0 +1,67 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============== +5-level paging +============== + +Overview +======== +Original x86-64 was limited by 4-level paing to 256 TiB of virtual address +space and 64 TiB of physical address space. We are already bumping into +this limit: some vendors offers servers with 64 TiB of memory today. + +To overcome the limitation upcoming hardware will introduce support for +5-level paging. It is a straight-forward extension of the current page +table structure adding one more layer of translation. + +It bumps the limits to 128 PiB of virtual address space and 4 PiB of +physical address space. This "ought to be enough for anybody" ©. + +QEMU 2.9 and later support 5-level paging. + +Virtual memory layout for 5-level paging is described in +Documentation/x86/x86_64/mm.txt + + +Enabling 5-level paging +======================= +CONFIG_X86_5LEVEL=y enables the feature. + +Kernel with CONFIG_X86_5LEVEL=y still able to boot on 4-level hardware. +In this case additional page table level -- p4d -- will be folded at +runtime. + +User-space and large virtual address space +========================================== +On x86, 5-level paging enables 56-bit userspace virtual address space. +Not all user space is ready to handle wide addresses. It's known that +at least some JIT compilers use higher bits in pointers to encode their +information. It collides with valid pointers with 5-level paging and +leads to crashes. + +To mitigate this, we are not going to allocate virtual address space +above 47-bit by default. + +But userspace can ask for allocation from full address space by +specifying hint address (with or without MAP_FIXED) above 47-bits. + +If hint address set above 47-bit, but MAP_FIXED is not specified, we try +to look for unmapped area by specified address. If it's already +occupied, we look for unmapped area in *full* address space, rather than +from 47-bit window. + +A high hint address would only affect the allocation in question, but not +any future mmap()s. + +Specifying high hint address on older kernel or on machine without 5-level +paging support is safe. The hint will be ignored and kernel will fall back +to allocation from 47-bit address space. + +This approach helps to easily make application's memory allocator aware +about large address space without manually tracking allocated virtual +address space. + +One important case we need to handle here is interaction with MPX. +MPX (without MAWA extension) cannot handle addresses above 47-bit, so we +need to make sure that MPX cannot be enabled we already have VMA above +the boundary and forbid creating such VMAs once MPX is enabled. diff --git a/Documentation/x86/x86_64/5level-paging.txt b/Documentation/x86/x86_64/5level-paging.txt deleted file mode 100644 index 2432a5ef86d9..000000000000 --- a/Documentation/x86/x86_64/5level-paging.txt +++ /dev/null @@ -1,61 +0,0 @@ -== Overview == - -Original x86-64 was limited by 4-level paing to 256 TiB of virtual address -space and 64 TiB of physical address space. We are already bumping into -this limit: some vendors offers servers with 64 TiB of memory today. - -To overcome the limitation upcoming hardware will introduce support for -5-level paging. It is a straight-forward extension of the current page -table structure adding one more layer of translation. - -It bumps the limits to 128 PiB of virtual address space and 4 PiB of -physical address space. This "ought to be enough for anybody" ©. - -QEMU 2.9 and later support 5-level paging. - -Virtual memory layout for 5-level paging is described in -Documentation/x86/x86_64/mm.txt - -== Enabling 5-level paging == - -CONFIG_X86_5LEVEL=y enables the feature. - -Kernel with CONFIG_X86_5LEVEL=y still able to boot on 4-level hardware. -In this case additional page table level -- p4d -- will be folded at -runtime. - -== User-space and large virtual address space == - -On x86, 5-level paging enables 56-bit userspace virtual address space. -Not all user space is ready to handle wide addresses. It's known that -at least some JIT compilers use higher bits in pointers to encode their -information. It collides with valid pointers with 5-level paging and -leads to crashes. - -To mitigate this, we are not going to allocate virtual address space -above 47-bit by default. - -But userspace can ask for allocation from full address space by -specifying hint address (with or without MAP_FIXED) above 47-bits. - -If hint address set above 47-bit, but MAP_FIXED is not specified, we try -to look for unmapped area by specified address. If it's already -occupied, we look for unmapped area in *full* address space, rather than -from 47-bit window. - -A high hint address would only affect the allocation in question, but not -any future mmap()s. - -Specifying high hint address on older kernel or on machine without 5-level -paging support is safe. The hint will be ignored and kernel will fall back -to allocation from 47-bit address space. - -This approach helps to easily make application's memory allocator aware -about large address space without manually tracking allocated virtual -address space. - -One important case we need to handle here is interaction with MPX. -MPX (without MAWA extension) cannot handle addresses above 47-bit, so we -need to make sure that MPX cannot be enabled we already have VMA above -the boundary and forbid creating such VMAs once MPX is enabled. - diff --git a/Documentation/x86/x86_64/index.rst b/Documentation/x86/x86_64/index.rst index 4b65d29ef459..7b8c82151358 100644 --- a/Documentation/x86/x86_64/index.rst +++ b/Documentation/x86/x86_64/index.rst @@ -10,3 +10,4 @@ x86_64 Support boot-options uefi mm + 5level-paging -- cgit v1.2.3 From f0339db77665e9a196c3afd55386becfb9e70d51 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:39 +0800 Subject: Documentation: x86: convert x86_64/fake-numa-for-cpusets to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/x86_64/fake-numa-for-cpusets | 67 ------------------- Documentation/x86/x86_64/fake-numa-for-cpusets.rst | 78 ++++++++++++++++++++++ Documentation/x86/x86_64/index.rst | 1 + 3 files changed, 79 insertions(+), 67 deletions(-) delete mode 100644 Documentation/x86/x86_64/fake-numa-for-cpusets create mode 100644 Documentation/x86/x86_64/fake-numa-for-cpusets.rst (limited to 'Documentation/x86') diff --git a/Documentation/x86/x86_64/fake-numa-for-cpusets b/Documentation/x86/x86_64/fake-numa-for-cpusets deleted file mode 100644 index 4b09f18831f8..000000000000 --- a/Documentation/x86/x86_64/fake-numa-for-cpusets +++ /dev/null @@ -1,67 +0,0 @@ -Using numa=fake and CPUSets for Resource Management -Written by David Rientjes - -This document describes how the numa=fake x86_64 command-line option can be used -in conjunction with cpusets for coarse memory management. Using this feature, -you can create fake NUMA nodes that represent contiguous chunks of memory and -assign them to cpusets and their attached tasks. This is a way of limiting the -amount of system memory that are available to a certain class of tasks. - -For more information on the features of cpusets, see -Documentation/cgroup-v1/cpusets.txt. -There are a number of different configurations you can use for your needs. For -more information on the numa=fake command line option and its various ways of -configuring fake nodes, see Documentation/x86/x86_64/boot-options.txt. - -For the purposes of this introduction, we'll assume a very primitive NUMA -emulation setup of "numa=fake=4*512,". This will split our system memory into -four equal chunks of 512M each that we can now use to assign to cpusets. As -you become more familiar with using this combination for resource control, -you'll determine a better setup to minimize the number of nodes you have to deal -with. - -A machine may be split as follows with "numa=fake=4*512," as reported by dmesg: - - Faking node 0 at 0000000000000000-0000000020000000 (512MB) - Faking node 1 at 0000000020000000-0000000040000000 (512MB) - Faking node 2 at 0000000040000000-0000000060000000 (512MB) - Faking node 3 at 0000000060000000-0000000080000000 (512MB) - ... - On node 0 totalpages: 130975 - On node 1 totalpages: 131072 - On node 2 totalpages: 131072 - On node 3 totalpages: 131072 - -Now following the instructions for mounting the cpusets filesystem from -Documentation/cgroup-v1/cpusets.txt, you can assign fake nodes (i.e. contiguous memory -address spaces) to individual cpusets: - - [root@xroads /]# mkdir exampleset - [root@xroads /]# mount -t cpuset none exampleset - [root@xroads /]# mkdir exampleset/ddset - [root@xroads /]# cd exampleset/ddset - [root@xroads /exampleset/ddset]# echo 0-1 > cpus - [root@xroads /exampleset/ddset]# echo 0-1 > mems - -Now this cpuset, 'ddset', will only allowed access to fake nodes 0 and 1 for -memory allocations (1G). - -You can now assign tasks to these cpusets to limit the memory resources -available to them according to the fake nodes assigned as mems: - - [root@xroads /exampleset/ddset]# echo $$ > tasks - [root@xroads /exampleset/ddset]# dd if=/dev/zero of=tmp bs=1024 count=1G - [1] 13425 - -Notice the difference between the system memory usage as reported by -/proc/meminfo between the restricted cpuset case above and the unrestricted -case (i.e. running the same 'dd' command without assigning it to a fake NUMA -cpuset): - Unrestricted Restricted - MemTotal: 3091900 kB 3091900 kB - MemFree: 42113 kB 1513236 kB - -This allows for coarse memory management for the tasks you assign to particular -cpusets. Since cpusets can form a hierarchy, you can create some pretty -interesting combinations of use-cases for various classes of tasks for your -memory management needs. diff --git a/Documentation/x86/x86_64/fake-numa-for-cpusets.rst b/Documentation/x86/x86_64/fake-numa-for-cpusets.rst new file mode 100644 index 000000000000..74fbb78b3c67 --- /dev/null +++ b/Documentation/x86/x86_64/fake-numa-for-cpusets.rst @@ -0,0 +1,78 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Fake NUMA For CPUSets +===================== + +:Author: David Rientjes + +Using numa=fake and CPUSets for Resource Management + +This document describes how the numa=fake x86_64 command-line option can be used +in conjunction with cpusets for coarse memory management. Using this feature, +you can create fake NUMA nodes that represent contiguous chunks of memory and +assign them to cpusets and their attached tasks. This is a way of limiting the +amount of system memory that are available to a certain class of tasks. + +For more information on the features of cpusets, see +Documentation/cgroup-v1/cpusets.txt. +There are a number of different configurations you can use for your needs. For +more information on the numa=fake command line option and its various ways of +configuring fake nodes, see Documentation/x86/x86_64/boot-options.txt. + +For the purposes of this introduction, we'll assume a very primitive NUMA +emulation setup of "numa=fake=4*512,". This will split our system memory into +four equal chunks of 512M each that we can now use to assign to cpusets. As +you become more familiar with using this combination for resource control, +you'll determine a better setup to minimize the number of nodes you have to deal +with. + +A machine may be split as follows with "numa=fake=4*512," as reported by dmesg:: + + Faking node 0 at 0000000000000000-0000000020000000 (512MB) + Faking node 1 at 0000000020000000-0000000040000000 (512MB) + Faking node 2 at 0000000040000000-0000000060000000 (512MB) + Faking node 3 at 0000000060000000-0000000080000000 (512MB) + ... + On node 0 totalpages: 130975 + On node 1 totalpages: 131072 + On node 2 totalpages: 131072 + On node 3 totalpages: 131072 + +Now following the instructions for mounting the cpusets filesystem from +Documentation/cgroup-v1/cpusets.txt, you can assign fake nodes (i.e. contiguous memory +address spaces) to individual cpusets:: + + [root@xroads /]# mkdir exampleset + [root@xroads /]# mount -t cpuset none exampleset + [root@xroads /]# mkdir exampleset/ddset + [root@xroads /]# cd exampleset/ddset + [root@xroads /exampleset/ddset]# echo 0-1 > cpus + [root@xroads /exampleset/ddset]# echo 0-1 > mems + +Now this cpuset, 'ddset', will only allowed access to fake nodes 0 and 1 for +memory allocations (1G). + +You can now assign tasks to these cpusets to limit the memory resources +available to them according to the fake nodes assigned as mems:: + + [root@xroads /exampleset/ddset]# echo $$ > tasks + [root@xroads /exampleset/ddset]# dd if=/dev/zero of=tmp bs=1024 count=1G + [1] 13425 + +Notice the difference between the system memory usage as reported by +/proc/meminfo between the restricted cpuset case above and the unrestricted +case (i.e. running the same 'dd' command without assigning it to a fake NUMA +cpuset): + + ======== ============ ========== + Name Unrestricted Restricted + ======== ============ ========== + MemTotal 3091900 kB 3091900 kB + MemFree 42113 kB 1513236 kB + ======== ============ ========== + +This allows for coarse memory management for the tasks you assign to particular +cpusets. Since cpusets can form a hierarchy, you can create some pretty +interesting combinations of use-cases for various classes of tasks for your +memory management needs. diff --git a/Documentation/x86/x86_64/index.rst b/Documentation/x86/x86_64/index.rst index 7b8c82151358..e2a324cde671 100644 --- a/Documentation/x86/x86_64/index.rst +++ b/Documentation/x86/x86_64/index.rst @@ -11,3 +11,4 @@ x86_64 Support uefi mm 5level-paging + fake-numa-for-cpusets -- cgit v1.2.3 From bdde117ffed2625389787f82fc92724b486ce976 Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:40 +0800 Subject: Documentation: x86: convert x86_64/cpu-hotplug-spec to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/x86_64/cpu-hotplug-spec | 21 --------------------- Documentation/x86/x86_64/cpu-hotplug-spec.rst | 24 ++++++++++++++++++++++++ Documentation/x86/x86_64/index.rst | 1 + 3 files changed, 25 insertions(+), 21 deletions(-) delete mode 100644 Documentation/x86/x86_64/cpu-hotplug-spec create mode 100644 Documentation/x86/x86_64/cpu-hotplug-spec.rst (limited to 'Documentation/x86') diff --git a/Documentation/x86/x86_64/cpu-hotplug-spec b/Documentation/x86/x86_64/cpu-hotplug-spec deleted file mode 100644 index 3c23e0587db3..000000000000 --- a/Documentation/x86/x86_64/cpu-hotplug-spec +++ /dev/null @@ -1,21 +0,0 @@ -Firmware support for CPU hotplug under Linux/x86-64 ---------------------------------------------------- - -Linux/x86-64 supports CPU hotplug now. For various reasons Linux wants to -know in advance of boot time the maximum number of CPUs that could be plugged -into the system. ACPI 3.0 currently has no official way to supply -this information from the firmware to the operating system. - -In ACPI each CPU needs an LAPIC object in the MADT table (5.2.11.5 in the -ACPI 3.0 specification). ACPI already has the concept of disabled LAPIC -objects by setting the Enabled bit in the LAPIC object to zero. - -For CPU hotplug Linux/x86-64 expects now that any possible future hotpluggable -CPU is already available in the MADT. If the CPU is not available yet -it should have its LAPIC Enabled bit set to 0. Linux will use the number -of disabled LAPICs to compute the maximum number of future CPUs. - -In the worst case the user can overwrite this choice using a command line -option (additional_cpus=...), but it is recommended to supply the correct -number (or a reasonable approximation of it, with erring towards more not less) -in the MADT to avoid manual configuration. diff --git a/Documentation/x86/x86_64/cpu-hotplug-spec.rst b/Documentation/x86/x86_64/cpu-hotplug-spec.rst new file mode 100644 index 000000000000..8d1c91f0c880 --- /dev/null +++ b/Documentation/x86/x86_64/cpu-hotplug-spec.rst @@ -0,0 +1,24 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================================================== +Firmware support for CPU hotplug under Linux/x86-64 +=================================================== + +Linux/x86-64 supports CPU hotplug now. For various reasons Linux wants to +know in advance of boot time the maximum number of CPUs that could be plugged +into the system. ACPI 3.0 currently has no official way to supply +this information from the firmware to the operating system. + +In ACPI each CPU needs an LAPIC object in the MADT table (5.2.11.5 in the +ACPI 3.0 specification). ACPI already has the concept of disabled LAPIC +objects by setting the Enabled bit in the LAPIC object to zero. + +For CPU hotplug Linux/x86-64 expects now that any possible future hotpluggable +CPU is already available in the MADT. If the CPU is not available yet +it should have its LAPIC Enabled bit set to 0. Linux will use the number +of disabled LAPICs to compute the maximum number of future CPUs. + +In the worst case the user can overwrite this choice using a command line +option (additional_cpus=...), but it is recommended to supply the correct +number (or a reasonable approximation of it, with erring towards more not less) +in the MADT to avoid manual configuration. diff --git a/Documentation/x86/x86_64/index.rst b/Documentation/x86/x86_64/index.rst index e2a324cde671..c04b6eab3c76 100644 --- a/Documentation/x86/x86_64/index.rst +++ b/Documentation/x86/x86_64/index.rst @@ -12,3 +12,4 @@ x86_64 Support mm 5level-paging fake-numa-for-cpusets + cpu-hotplug-spec -- cgit v1.2.3 From e115fb4bd2667754135d0436b6419fb171e9ea4a Mon Sep 17 00:00:00 2001 From: Changbin Du Date: Wed, 8 May 2019 23:21:41 +0800 Subject: Documentation: x86: convert x86_64/machinecheck to reST This converts the plain text documentation to reStructuredText format and add it to Sphinx TOC tree. No essential content change. Signed-off-by: Changbin Du Reviewed-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/x86/x86_64/index.rst | 1 + Documentation/x86/x86_64/machinecheck | 83 ------------------------------ Documentation/x86/x86_64/machinecheck.rst | 85 +++++++++++++++++++++++++++++++ 3 files changed, 86 insertions(+), 83 deletions(-) delete mode 100644 Documentation/x86/x86_64/machinecheck create mode 100644 Documentation/x86/x86_64/machinecheck.rst (limited to 'Documentation/x86') diff --git a/Documentation/x86/x86_64/index.rst b/Documentation/x86/x86_64/index.rst index c04b6eab3c76..d6eaaa5a35fc 100644 --- a/Documentation/x86/x86_64/index.rst +++ b/Documentation/x86/x86_64/index.rst @@ -13,3 +13,4 @@ x86_64 Support 5level-paging fake-numa-for-cpusets cpu-hotplug-spec + machinecheck diff --git a/Documentation/x86/x86_64/machinecheck b/Documentation/x86/x86_64/machinecheck deleted file mode 100644 index d0648a74fceb..000000000000 --- a/Documentation/x86/x86_64/machinecheck +++ /dev/null @@ -1,83 +0,0 @@ - -Configurable sysfs parameters for the x86-64 machine check code. - -Machine checks report internal hardware error conditions detected -by the CPU. Uncorrected errors typically cause a machine check -(often with panic), corrected ones cause a machine check log entry. - -Machine checks are organized in banks (normally associated with -a hardware subsystem) and subevents in a bank. The exact meaning -of the banks and subevent is CPU specific. - -mcelog knows how to decode them. - -When you see the "Machine check errors logged" message in the system -log then mcelog should run to collect and decode machine check entries -from /dev/mcelog. Normally mcelog should be run regularly from a cronjob. - -Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN -(N = CPU number) - -The directory contains some configurable entries: - -Entries: - -bankNctl -(N bank number) - 64bit Hex bitmask enabling/disabling specific subevents for bank N - When a bit in the bitmask is zero then the respective - subevent will not be reported. - By default all events are enabled. - Note that BIOS maintain another mask to disable specific events - per bank. This is not visible here - -The following entries appear for each CPU, but they are truly shared -between all CPUs. - -check_interval - How often to poll for corrected machine check errors, in seconds - (Note output is hexadecimal). Default 5 minutes. When the poller - finds MCEs it triggers an exponential speedup (poll more often) on - the polling interval. When the poller stops finding MCEs, it - triggers an exponential backoff (poll less often) on the polling - interval. The check_interval variable is both the initial and - maximum polling interval. 0 means no polling for corrected machine - check errors (but some corrected errors might be still reported - in other ways) - -tolerant - Tolerance level. When a machine check exception occurs for a non - corrected machine check the kernel can take different actions. - Since machine check exceptions can happen any time it is sometimes - risky for the kernel to kill a process because it defies - normal kernel locking rules. The tolerance level configures - how hard the kernel tries to recover even at some risk of - deadlock. Higher tolerant values trade potentially better uptime - with the risk of a crash or even corruption (for tolerant >= 3). - - 0: always panic on uncorrected errors, log corrected errors - 1: panic or SIGBUS on uncorrected errors, log corrected errors - 2: SIGBUS or log uncorrected errors, log corrected errors - 3: never panic or SIGBUS, log all errors (for testing only) - - Default: 1 - - Note this only makes a difference if the CPU allows recovery - from a machine check exception. Current x86 CPUs generally do not. - -trigger - Program to run when a machine check event is detected. - This is an alternative to running mcelog regularly from cron - and allows to detect events faster. -monarch_timeout - How long to wait for the other CPUs to machine check too on a - exception. 0 to disable waiting for other CPUs. - Unit: us - -TBD document entries for AMD threshold interrupt configuration - -For more details about the x86 machine check architecture -see the Intel and AMD architecture manuals from their developer websites. - -For more details about the architecture see -see http://one.firstfloor.org/~andi/mce.pdf diff --git a/Documentation/x86/x86_64/machinecheck.rst b/Documentation/x86/x86_64/machinecheck.rst new file mode 100644 index 000000000000..e189168406fa --- /dev/null +++ b/Documentation/x86/x86_64/machinecheck.rst @@ -0,0 +1,85 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================================================== +Configurable sysfs parameters for the x86-64 machine check code +=============================================================== + +Machine checks report internal hardware error conditions detected +by the CPU. Uncorrected errors typically cause a machine check +(often with panic), corrected ones cause a machine check log entry. + +Machine checks are organized in banks (normally associated with +a hardware subsystem) and subevents in a bank. The exact meaning +of the banks and subevent is CPU specific. + +mcelog knows how to decode them. + +When you see the "Machine check errors logged" message in the system +log then mcelog should run to collect and decode machine check entries +from /dev/mcelog. Normally mcelog should be run regularly from a cronjob. + +Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN +(N = CPU number). + +The directory contains some configurable entries: + +bankNctl + (N bank number) + + 64bit Hex bitmask enabling/disabling specific subevents for bank N + When a bit in the bitmask is zero then the respective + subevent will not be reported. + By default all events are enabled. + Note that BIOS maintain another mask to disable specific events + per bank. This is not visible here + +The following entries appear for each CPU, but they are truly shared +between all CPUs. + +check_interval + How often to poll for corrected machine check errors, in seconds + (Note output is hexadecimal). Default 5 minutes. When the poller + finds MCEs it triggers an exponential speedup (poll more often) on + the polling interval. When the poller stops finding MCEs, it + triggers an exponential backoff (poll less often) on the polling + interval. The check_interval variable is both the initial and + maximum polling interval. 0 means no polling for corrected machine + check errors (but some corrected errors might be still reported + in other ways) + +tolerant + Tolerance level. When a machine check exception occurs for a non + corrected machine check the kernel can take different actions. + Since machine check exceptions can happen any time it is sometimes + risky for the kernel to kill a process because it defies + normal kernel locking rules. The tolerance level configures + how hard the kernel tries to recover even at some risk of + deadlock. Higher tolerant values trade potentially better uptime + with the risk of a crash or even corruption (for tolerant >= 3). + + 0: always panic on uncorrected errors, log corrected errors + 1: panic or SIGBUS on uncorrected errors, log corrected errors + 2: SIGBUS or log uncorrected errors, log corrected errors + 3: never panic or SIGBUS, log all errors (for testing only) + + Default: 1 + + Note this only makes a difference if the CPU allows recovery + from a machine check exception. Current x86 CPUs generally do not. + +trigger + Program to run when a machine check event is detected. + This is an alternative to running mcelog regularly from cron + and allows to detect events faster. +monarch_timeout + How long to wait for the other CPUs to machine check too on a + exception. 0 to disable waiting for other CPUs. + Unit: us + +TBD document entries for AMD threshold interrupt configuration + +For more details about the x86 machine check architecture +see the Intel and AMD architecture manuals from their developer websites. + +For more details about the architecture see +see http://one.firstfloor.org/~andi/mce.pdf -- cgit v1.2.3