Linux Expert Syscalls
Expert-Level Linux Syscalls & Kernel Interfaces
A comprehensive reference for senior performance engineers: syscalls, kernel interfaces, and mechanisms that go beyond the basics. Each entry covers what it does, why it matters, who uses it, and the sharp edges.
CPU Performance & Execution
sched_yield()vs spin-wait vsfutex(FUTEX_WAIT)— The tradeoff triangle.sched_yield()is almost always wrong — it yields to any runnable thread, not the one holding your lock. Spin-wait withpauseinstruction (x86_mm_pause()) is correct for <1μs waits on dedicated cores.futexis correct for everything else. Senior engineers knowsched_yieldin a spinlock is an anti-pattern that causes priority inversion and cache thrashing.getcpu()via vDSO — Not a real syscall — the kernel maps a page into every process containing the current CPU/NUMA node. Costs ~1ns vs ~100ns for a real syscall. Used to index per-CPU data structures.rdtscpalso returns the CPU ID in theIA32_TSC_AUXregister if you want even lower overhead.arch_prctl(ARCH_SET_CPUID)(x86) — Trap onCPUIDinstruction execution. Used by emulators and VMs to intercept and rewrite CPUID results without full hardware virtualization. Also used by performance-critical code to detect when the kernel migrates your thread mid-execution (the CPUID trap fires on the new core).prctl(PR_SET_SPECULATION_CTRL)— Per-process control over Spectre mitigations.PR_SPEC_DISABLEforces IBRS/STIBP (safe but ~5-15% overhead).PR_SPEC_FORCE_DISABLElocks it for the process and all descendants. HFT shops sometimes usePR_SPEC_ENABLEto reclaim that 5-15% on isolated cores, accepting the Spectre risk on a machine processing no untrusted code.prctl(PR_TASK_PERF_EVENTS_DISABLE / ENABLE)— Pause and resume allperf_event_opencounters for the calling task. Used for precise measurement windows — disable counters, run setup code, enable counters, run the hot path, disable counters. Eliminates noise from initialization.UMCG(User-Managed Concurrency Groups) — (proposed, 5.x+) Cooperative scheduling between userspace and kernel. A userspace scheduler (Go runtime, Java virtual threads) can context-switch between tasks without entering the kernel, but still get notified when a task blocks on I/O. Google's internal fiber library drove this. Lets you implement M:N threading with proper kernel integration rather than the hacks Go currently uses.sched_setattr()withSCHED_FLAG_UTIL_CLAMP— Clamp the CPU frequency scaling hint for a thread.util_minsays "never drop below this frequency" (prevents clock ramp-up latency on the fast path).util_maxsays "never exceed this frequency" (power cap). Android uses this extensively — UI thread getsutil_min=512, background sync getsutil_max=256.membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ)— Force all threads in the process to restart theirrseqcritical sections. If you update a per-CPU data structure from a remote thread, you need this to ensure no reader is mid-critical-section with stale data. The cost is an IPI to each core — expensive, but cheaper than taking a lock on the read side.perf_event_open()withPERF_COUNT_HW_REF_CPU_CYCLESvsPERF_COUNT_HW_CPU_CYCLES— Reference cycles count at a fixed rate regardless of frequency scaling. CPU cycles count at the actual core frequency. The ratio tells you your effective clock speed. Ifcycles/ref_cyclesis 0.6, your code is running at 60% max frequency — probably throttled by power or thermal limits. This is how you diagnose "the code is correct but slow in production" when the answer is CPU throttling./sys/devices/cpu/rdpmc+mmapof perf counter — Afterperf_event_open, you canmmapthe counter and read it from userspace viaRDPMCinstruction. Costs ~20 cycles vs ~1000 cycles for aread()syscall. Nanobenchmark frameworks use this for per-function IPC, cache miss, and branch misprediction counting.sched_setattr()withSCHED_DEADLINE— Earliest Deadline First scheduling. You declare runtime, deadline, and period — the kernel guarantees CPU time. Used in real-time audio, robotics, and low-latency trading.perf_event_open()— Direct access to hardware performance counters (cache misses, branch mispredictions, TLB misses, IPC). The underlying syscall behindperf. You can set up ring buffers for sampling with zero syscall overhead in the steady state.clone3()— The modern version ofclone()with extensible struct-based args. SupportsCLONE_INTO_CGROUP(spawn directly into a cgroup),CLONE_PIDFD(get a pidfd atomically at creation).sched_setaffinity()+getcpu()— Pin threads to cores.getcpu()tells you which core you're on without a syscall (uses vDSO). Critical for NUMA-aware data structures and per-core sharding.futex()(specificallyFUTEX_WAIT_BITSET,FUTEX_LOCK_PI) — Priority-inheritance futexes prevent priority inversion. Bitset waiters allow selective wakeup of specific threads. The foundation of every serious userspace synchronization primitive.rseq()(restartable sequences) — Register a per-CPU critical section that the kernel restarts if the thread migrates. Enables per-CPU data structures without locks or atomics — glibc'smallocuses this since glibc 2.35.
Memory — Allocators, Huge Pages & NUMA
madvise(MADV_HUGEPAGE)on specific VMAs vs system-wide THP — The expert move is to disable THP globally (/sys/kernel/mm/transparent_hugepage/enabled = madvise) and thenMADV_HUGEPAGEonly on your known-hot regions (buffer pool, JIT code cache, hash tables). This gets huge page benefits withoutkhugepagedcompaction storms on your heap. ScyllaDB, ClickHouse, and every serious database does this.MAP_HUGETLB | MAP_HUGE_1GB— Explicit 1GB huge pages. Eliminates TLB misses entirely for large memory regions — a 256GB buffer pool needs only 256 TLB entries instead of 67 million. Requires boot-time reservation (hugepagesz=1G hugepages=64). The kernel cannot compact 1GB pages at runtime, so they must be reserved early. Database vendors (Oracle, SAP HANA) require this in production tuning guides.mbind(MPOL_INTERLEAVE)— Interleave allocations round-robin across NUMA nodes. Counter-intuitive: for shared data structures accessed by all cores, interleaving is faster than local allocation because it distributes memory bandwidth across all memory controllers. Linux's page cache uses interleaving by default for exactly this reason.move_pages()with status query (NULL new_nodes) — Query the NUMA location of specific pages without moving them. Build a heatmap of where your data actually lives. If your "NUMA-local" allocation ended up on the wrong node (because of memory pressure and rebalancing), this tells you. Then usemigrate_pages()ormove_pages()to fix it.set_mempolicy(MPOL_PREFERRED_MANY)(5.15+) — Prefer allocation from a set of NUMA nodes instead of just one. Falls back to other nodes only when all preferred nodes are exhausted. Designed for CXL memory tiers where you have DRAM + CXL-attached DRAM and want a preference order without hard binding.MADV_DONTFORK/MADV_WIPEONFORK—DONTFORKunmaps the region in the child after fork.WIPEONFORKzeros it. Critical for security-sensitive memory (encryption keys, ASLR secrets). WithoutDONTFORK, a forked child inherits your crypto key material via CoW pages. OpenSSL should use this on its entropy pool but famously didn't for years.mprotect()withPROT_MTE(ARM64 MTE) — Enable Memory Tagging Extension on a memory region. Every 16-byte granule gets a 4-bit color tag. Pointers carry a tag in their top bits. Mismatched tag = synchronous fault. Catches use-after-free and buffer overflows in hardware at near-zero cost. Android 14+ enables this for system daemons. ThePROT_MTEflag onmmap/mprotectcontrols which regions are tagged.process_mrelease()(5.15+) — Release the memory of an already-dying process faster. When youkill -9a huge-memory process, the kernel can take seconds to free hundreds of GB (holdingmmap_lock, stalling everything). This syscall tells the kernel "prioritize reaping this process's pages." OOM killers and container runtimes use this so a dead 512GB JVM doesn't stall the entire host for 10 seconds.KSM (Kernel Same-page Merging)viamadvise(MADV_MERGEABLE)— The kernel periodically scans marked regions and CoW-merges identical pages. A hypervisor running 100 VMs with the same OS image saves ~30-50% memory. The tradeoff: the scanning thread (ksmd) costs CPU, and CoW faults when pages diverge can cause latency spikes. Expert tuning involvespages_to_scan,sleep_millisecs, and per-processMADV_MERGEABLEtargeting.cachestat()(6.5+) — Query how many pages of a file range are in the page cache, dirty, evicted, etc. — in a single syscall. Before this, you had tomincore()every page individually (one bit per page, no dirty info). Database engines use this to decide whether to use buffered I/O (data already cached) or direct I/O (data cold, skip the cache).madvise(MADV_HUGEPAGE / MADV_FREE / MADV_DONTNEED / MADV_SEQUENTIAL)— Fine-grained hints to the kernel's page cache.MADV_FREE(lazy free) is cheaper thanDONTNEEDbecause it defers actual page reclaim.MADV_POPULATE_READ/WRITE(5.14+) pre-faults pages without touching them from userspace — eliminates page fault storms on large allocations.mlock2(MCL_ONFAULT)— Lock pages into RAM only when faulted in, not eagerly. Critical for latency-sensitive apps that allocate large arenas but use them sparsely.process_madvise()— Apply madvise to another process's address space. Used by memory management daemons and container runtimes.userfaultfd()— Get notified in userspace when a page fault occurs. Enables live migration (QEMU/CRIU), lazy restore of checkpointed processes, and userspace-managed paging (post-copy migration). Also used for custom allocators and CRIU checkpoint/restore.memfd_create()+memfd_secret()—memfd_creategives you anonymous file-backed memory (great for zero-copy IPC withsendmsg/SCM_RIGHTS).memfd_secret()(5.14+) creates memory that's unmapped from the kernel's direct map — even root/kernel can't read it. Defense against Spectre/Meltdown side-channels.mbind()/set_mempolicy()/move_pages()— NUMA memory placement.mbindbinds memory ranges to specific NUMA nodes.move_pagesmigrates individual pages between nodes. Essential for database buffer pools and HPC.
Virtual Memory & TLB Tricks
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME)— Name anonymous VMAs so they show up labeled in/proc/pid/maps. Android uses this extensively for debugging memory — you can tell your allocator's arenas apart from mmap'd files. Trivial to use, almost nobody outside Android/game engines knows it exists.mmap(MAP_FIXED_NOREPLACE)— LikeMAP_FIXEDbut fails instead of silently unmapping existing mappings. Eliminates an entire class of bugs where you accidentally clobber your own address space. Database buffer pools should always use this.mmap(MAP_POPULATE | MAP_LOCKED)— Pre-fault and wire all pages at mmap time. Eliminates both minor faults and swapout. The nuclear option for latency — used in HFT shared memory segments.MADV_COLD/MADV_PAGEOUT(5.4+) —MADV_COLDdemotes pages to the inactive list (they'll be reclaimed first under pressure).MADV_PAGEOUTimmediately reclaims them to swap/zswap. Android'sActivityManageruses these to manage app memory tiers. Lets you implement your own multi-tier memory policy from userspace.mremap(MREMAP_DONTUNMAP)(5.7+) — Remap pages to a new address but leave the old VMA in place as empty (zero-fill-on-demand). Enables atomic-ish relocation of mappings for concurrent data structures — readers on the old address get zeroes instead of SIGSEGV while you update pointers. Used in userspace garbage collectors.PR_SET_THP_DISABLE— Disable transparent huge pages for a specific process. When your allocation pattern fragments 2MB pages, THP's compaction daemon (khugepaged) causes multi-millisecond tail latency spikes. Redis, ScyllaDB, and most databases disable THP for this reason — this prctl lets you do it per-process instead of system-wide.mincore()— Query which pages of a mapping are resident in the page cache. Lets you build page-cache-aware prefetching — only issue readahead for ranges not already cached. Database query engines use this to avoid redundant I/O on hot data.
Storage & Direct I/O
O_DIRECT+io_uringwith registered buffers — The holy grail of storage I/O.O_DIRECTbypasses the page cache (no double-buffering, predictable memory usage).io_uringwithIORING_REGISTER_BUFFERS+IORING_OP_READ_FIXEDpre-registers DMA-able memory regions so the kernel skipsget_user_pages()on every I/O. Combined withIORING_SETUP_SQPOLL, you get DMA from NVMe directly to your buffer pool with zero syscalls and zero page table walks in the hot path.pwritev2()withRWF_DSYNC— Per-I/OO_DSYNCwithout opening the file withO_DSYNC. Lets you choose per write whether you need durability. Database WAL writes useRWF_DSYNC, but data page writes go buffered andsync_file_range()later. Avoids the all-or-nothingO_DSYNCflag on the fd.pwritev2()withRWF_NOWAIT— ReturnEAGAINif the write would block (e.g., needs journal space, memory allocation, or inode lookup). Combined with io_uring, this lets you keep the fast path non-blocking and fall back to the kernel worker thread pool only when necessary. Avoids head-of-line blocking in your submission queue.pwritev2()withRWF_APPEND— Atomic append withoutO_APPENDon the fd. Each write atomically fetches-and-advances the file offset. Multiple threads can append to a log file without locks orO_APPEND(which forces all writes to append, even seeks).preadv2()withRWF_HIPRI— Request polling-based completion instead of interrupt-driven. The kernel busy-polls the NVMe completion queue for this specific I/O. Reduces latency by ~5-10μs (the interrupt latency). Only useful for ultra-low-latency NVMe devices; on SATA it does nothing.fallocate(FALLOC_FL_ZERO_RANGE)vsFALLOC_FL_PUNCH_HOLE— Both create "holes" but differently.ZERO_RANGEguarantees zeroed reads but may or may not deallocate blocks (filesystem-dependent).PUNCH_HOLEdeallocates blocks and the file becomes sparse. For a WAL that recycles segments,ZERO_RANGEis correct (zeroed data, pre-allocated extents preserved).PUNCH_HOLEis for freeing space (log compaction, sparse snapshots).ioctl(BLKDISCARD)/ioctl(BLKSECDISCARD)— Tell the SSD to TRIM/discard a range of LBAs.BLKDISCARDis a hint (SSD may or may not zero).BLKSECDISCARDis a guarantee (cryptographic erase). Database drop-table can issueBLKDISCARDon the underlying block range to reclaim SSD write amplification budget and maintain consistent performance.ioctl(FICLONE)/ioctl(FICLONERANGE)— Reflink copy: create a CoW clone of a file (or byte range) instantly. Both files share the same physical extents until one is modified. On btrfs/XFS (with reflink), cloning a 100GB database file takes microseconds. Used for instant snapshots, test database provisioning, and fork-based MVCC.io_uringIORING_OP_SPLICE+IORING_OP_TEE— Zero-copy data movement through io_uring. Chain asplicefrom a socket to a pipe and ateefrom the pipe to a file — you've just logged all network traffic to disk with zero copies and zero syscalls in the steady state. Proxy servers and packet capture tools use this.io_uringIORING_SETUP_REGISTERED_FD_ONLY+IORING_REGISTER_FILES— Pre-register file descriptors so io_uring uses direct indexing instead offdget()/fdput()per operation. On a server with 100K open fds, thefdtablelock contention is measurable. Registered fds bypass the fdtable entirely.- NVMe passthrough via
io_uring(IORING_OP_URING_CMDon/dev/ng0n1) — Send raw NVMe commands (admin or I/O) through io_uring. Bypass the entire block layer. You can issue vendor-specific commands (SSD internal telemetry, zone management, key-value commands on KV-SSDs), or do I/O with custom metadata (T10-PI, DIF). SPDK without leaving the kernel. splice()/tee()/vmsplice()— Zero-copy data movement between file descriptors via kernel pipe buffers.vmsplicemaps userspace pages into a pipe (true zero-copy). Nginx, HAProxy, and Kafka use these heavily.copy_file_range()— Server-side copy between two fds, potentially without data touching userspace or even the CPU (on NFS, reflink-capable filesystems like btrfs/XFS).fallocate(FALLOC_FL_PUNCH_HOLE / FALLOC_FL_COLLAPSE_RANGE / FALLOC_FL_INSERT_RANGE)— Surgically manipulate file extents. Punch holes for sparse files, collapse ranges to remove data without rewriting, insert ranges to add space mid-file. Database engines use this for WAL recycling and SST compaction.sync_file_range()— Initiate writeback on a byte range without waiting or flushing metadata. Lets you pipeline data durability — start flushing page N while writing page N+1. PostgreSQL's WAL writer uses this.open_by_handle_at()/name_to_handle_at()— Resolve files by filesystem-level handle (inode + generation), bypassing path resolution entirely. NFS servers and backup tools use these. Also a privilege escalation vector if misused.readahead()— Explicitly trigger kernel readahead into the page cache for a file range. Useful when your access pattern is known but not sequential enough for the kernel's heuristic.fadvise(POSIX_FADV_DONTNEED)— Evict pages from the page cache after you're done. Prevents a big sequential scan from polluting the cache for other workloads.
File System Internals
FS_IOC_FIEMAP(ioctl) — Query the exact physical block layout of a file on disk. Returns extent maps showing which logical byte ranges map to which physical disk offsets. Database storage engines use this to detect fragmentation and make defragmentation decisions. Also reveals if a file is sparse, preallocated, or reflinked.FS_IOC_GETFLAGS/FS_IOC_SETFLAGS— Per-file flags likeFS_IMMUTABLE_FL(even root can't modify),FS_APPEND_FL(append-only),FS_NOCOW_FL(disable copy-on-write on btrfs — critical for database files on btrfs, without it random writes trigger CoW storms).chattris the CLI wrapper but the ioctl is what you use programmatically.renameat2(RENAME_EXCHANGE)— Atomically swap two files. Both paths exist before and after. Used for atomic config updates — write new config to a temp file, thenRENAME_EXCHANGEwith the live config. If you crash mid-swap, you still have the old file at one path and the new file at the other. SQLite's WAL checkpointing could use this.renameat2(RENAME_WHITEOUT)— Rename a file and leave a whiteout entry at the old path. Used by overlay filesystems (Docker's storage driver) — a whiteout means "this file was deleted in the upper layer" without physically removing the lower layer's copy.statx()— The modernstat()with extensible struct. Returnsstx_btime(birth/creation time —stat()can't do this),stx_dio_mem_align/stx_dio_offset_align(the exact alignment requirements for O_DIRECT on this file — no more guessing),stx_mount_id. The alignment fields alone are worth the switch — different filesystems and block devices have different direct I/O alignment requirements and this tells you exactly.openat2()withRESOLVE_BENEATH | RESOLVE_NO_SYMLINKS | RESOLVE_NO_MAGICLINKS(5.6+) — Path resolution with security constraints.RESOLVE_BENEATHprevents escaping a directory (replacesO_NOFOLLOW+ manual checks).RESOLVE_NO_MAGICLINKSblocks/proc/self/fd/Ntraversal. Container runtimes use this to prevent symlink attacks on bind mounts.
Profiling & Observability
perf_event_open()withPERF_SAMPLE_DATA_SRC— For each sampled memory access, tells you where the data came from: L1/L2/L3 hit, local DRAM, remote DRAM (cross-NUMA), I/O (MMIO). This is how you prove that your "NUMA-aware" data structure is actually hitting remote DRAM 40% of the time.perf mem recordwraps this.perf_event_open()withPERF_SAMPLE_WEIGHT— Each sample carries a weight (latency in cycles). Sort by weight to find the most expensive individual memory accesses, not just the most frequent. One L3 miss costs 200 cycles; finding the 10 instructions responsible for 90% of your stall cycles is the difference between a 2% and a 30% optimization.perf_event_open()withPERF_SAMPLE_PHYS_ADDR— Sample the physical address of memory accesses. Map physical addresses to DIMM topology to identify a failing DIMM causing correctable ECC errors that slow down accesses. Also reveals DRAM bank conflicts and DRAM row buffer miss patterns.- Intel PEBS (Processor Event-Based Sampling) via
perf_event_open()withprecise_ip=3— The CPU itself writes samples to a hardware buffer, then interrupts the kernel to drain it. Gives exact IP attribution (no skid). Normal sampling has 1-100 instruction skid — your profile says function A is hot but actually function B three instructions earlier caused the event. PEBS eliminates this. - AMD IBS (Instruction-Based Sampling) — AMD's answer to PEBS. Randomly selects instructions and records everything about their execution: fetch latency, cache hit level, TLB level, branch prediction correctness, completion latency. More comprehensive than PEBS — IBS Fetch and IBS Op give you the full microarchitectural story per instruction.
bpf()withBPF_PROG_TYPE_RAW_TRACEPOINT— Attach to raw tracepoints without the stable tracepoint ABI overhead. Access rawstructfields directly via BTF instead of going through the tracepoint format string parser. ~2x faster than regular tracepoints for high-frequency events (scheduler, memory allocator).perf_event_open()withPERF_TYPE_BREAKPOINT— Hardware watchpoints via the debug registers. Monitor reads/writes to a specific memory address. Up to 4 simultaneous watchpoints on x86. Faster than page-fault-based watchpoints (mprotecttricks) and more precise. Used to answer "who is writing to this corrupted field?"perf_event_open()with Intel PT (PERF_TYPE_INTEL_PT) — Full instruction-level tracing via hardware. The CPU logs every branch taken to a ring buffer at ~5% overhead. You can reconstruct the exact execution path post-mortem. Used for coverage-guided fuzzing (AFL/libFuzzer), reverse debugging (rr, UDB), and production tracing of rare bugs.usdt(User Statically-Defined Tracepoints) viabpf()— Embed NOP sleds in your binary at key points. eBPF attaches to them at runtime with zero cost when disabled. PostgreSQL, MySQL, JVM, and Python have USDT probes.bpftrace -p $PID -e 'usdt:postgresql:query__start { printf("%s\n", str(arg0)); }'— production query tracing with no restart./proc/pid/smaps_rollup— Single-read summary of a process's memory: RSS, PSS (proportional share of shared pages), swap, anonymous, file-backed, huge pages.smapsgives per-VMA detail but is O(n) in VMAs and takes >100ms for large processes.smaps_rollupis O(1) summary. Monitoring daemons should always usesmaps_rollup— reading fullsmapsfor a JVM with 100K VMAs can stall the target process./proc/pressure/{cpu,memory,io}(PSI) — System-wide pressure stall information.cat /proc/pressure/memorygives yousome avg10=0.50 avg60=0.25 avg300=0.10 total=12345— the percentage of time some tasks are stalled on memory. You canpoll()these files with trigger thresholds. Android and systemd-oomd use PSI triggers instead of free-memory thresholds for OOM decisions because PSI measures actual impact not just headroom.pidfd_open()/pidfd_send_signal()/pidfd_getfd()— Race-free process management via file descriptors (no PID recycling issues).pidfd_getfdsteals a file descriptor from another process — used by debuggers and container runtimes.kcmp()— Compare kernel objects between two processes (do these two fds point to the same underlying file/socket/pipe?). Used by checkpoint/restore (CRIU).
Security
landlock_restrict_self()— Unprivileged filesystem sandboxing. A process voluntarily restricts itself to a set of directories/file types. Unlike seccomp (which filters syscalls), Landlock filters filesystem operations. A document renderer can say "I can only read from/tmp/render_input/and write to/tmp/render_output/" — even if an exploit gives arbitrary code execution, it can't touch/etc/shadow.seccomp(SECCOMP_SET_MODE_FILTER)withSECCOMP_FILTER_FLAG_TSYNC— Apply seccomp filter to all threads simultaneously. WithoutTSYNC, you have a race window where a thread spawned betweencloneandseccompruns unfiltered. Every container runtime usesTSYNCbecause a sandbox with a gap is worse than no sandbox (false sense of security).prctl(PR_SET_NO_NEW_PRIVS)— The process (and all descendants) can never gain privileges viaexecveof setuid/setgid binaries or file capabilities. Required beforeseccomp(SECCOMP_SET_MODE_FILTER)for unprivileged processes. Without it, a seccomp-filtered process couldexeca setuid binary to escape. This is why it's mandatory.prctl(PR_SET_SECCOMP)+SECCOMP_RET_LOG— Audit mode: log denied syscalls to the kernel audit log without killing the process. Run your production workload with a restrictive filter in log mode for a week, analyze the logs, then tighten. Iterative seccomp profile development instead of guessing.memfd_secret()— Memory that's removed from the kernel's direct map. Even a kernel exploit (root + arbitrary kernel read) cannot read this memory without first re-mapping it. Designed for cryptographic key storage. The process sees it as normalmmap'd memory, butkptrand/proc/kcorecannot access it. Hardware cost: TLB pressure from splitting the direct map's huge pages.mseal()(6.10+) — Immutable memory permissions. Once sealed,mprotect,munmap,mremap,mmap(MAP_FIXED)all fail on the sealed range. Prevents ROP gadgets from doingmprotect(PROT_EXEC)on your data. Chrome V8 seals JIT code pages after compilation. The runtime cost is near-zero — it's just a flag on the VMA.prctl(PR_SET_MDWE)(6.3+) — Memory-Deny-Write-Execute. The process can never have a page that's both writable and executable simultaneously.mmap(PROT_WRITE|PROT_EXEC)fails.mprotectfromWtoXfails. You must write code, thenmprotecttoRX(W^X enforcement). Stops JIT spray attacks. systemd enables this for all its daemons.prctl(PR_SET_TAGGED_ADDR_CTRL)withPR_TAGGED_ADDR_ENABLE(ARM64) — Enable Top Byte Ignore (TBI). The kernel ignores the top byte of pointers in syscalls. This is the foundation for ARM's Memory Tagging (MTE), HWASan, and pointer authentication. Combined withPROT_MTE, you get hardware-enforced memory safety.keyctl()/add_key()/request_key()— The kernel keyring. Store crypto keys in kernel memory (not swappable, not visible in/proc/pid/mem). Keys can have access controls, expiration, and type-specific operations (asymmetric crypto, encrypted, trusted TPM-backed). dm-crypt, eCryptfs, fscrypt all use the keyring.KEYCTL_RESTRICT_KEYRINGprevents unauthorized key injection.ioctl(FS_IOC_SET_ENCRYPTION_POLICY)(fscrypt) — Per-directory encryption on ext4/f2fs/UBIFS. Each directory tree gets a different key. When the user locks their phone, the key is evicted from the keyring and the files become undecryptable ciphertext. Unlike full-disk encryption, this gives per-user/per-app key separation. Android and ChromeOS use this for at-rest encryption.fanotifywithFAN_OPEN_EXEC_PERM— Intercept and approve/deny every executable load. An anti-malware daemon gets an event beforeexecvecompletes, can inspect the binary, and returnFAN_ALLOWorFAN_DENY. This is Linux's answer to Windows'PsSetCreateProcessNotifyRoutine. ClamAV on-access scanning uses this.ptrace(PTRACE_SECCOMP_GET_FILTER)— Dump the seccomp BPF filter of a traced process. Used by security auditors to verify that a container's seccomp profile actually matches what was intended. Without this, you're trusting that the runtime installed the right filter.seccomp(SECCOMP_RET_USER_NOTIF)+ioctl(SECCOMP_IOCTL_NOTIF_ADDFD)— A supervisor process intercepts a sandboxed process's syscalls and can inject file descriptors into the target. The sandboxed process callsopen("/dev/gpu"), the supervisor validates it, opens the real device, and injects the fd. The sandboxed process never touches the real device node. This is how GPU passthrough works in unprivileged containers.seccomp(SECCOMP_SET_MODE_FILTER)— Install a BPF filter on all syscalls for the calling thread. Chrome, Docker, and every serious sandbox uses this. TheSECCOMP_RET_USER_NOTIFmode lets a supervisor process make policy decisions in userspace.
Reliability & Fault Tolerance
prctl(PR_MCE_KILL, PR_MCE_KILL_EARLY)— Machine Check Exception handling policy. Without this, a process that touches a page with a hardware memory error gets killed silently (or worse, reads corrupted data if the error is correctable). WithMCE_KILL_EARLY, you getSIGBUS(BUS_MCEERR_AO)immediately when the error is detected, even before you access the page. Database engines use this with a SIGBUS handler that marks the affected page as corrupt and initiates recovery from a replica.madvise(MADV_HWPOISON)— Simulate a hardware memory error on a page. Used to test your MCE recovery path without actual bad RAM. Inject poison, verify yourSIGBUShandler kicks in, verify your database marks the page bad and recovers. Essential for testing fault tolerance of in-memory databases.MADV_SOFT_OFFLINE— Ask the kernel to migrate data off a page that has correctable ECC errors and take the physical page offline. Proactive repair: monitoring tools watchmcelog/rasdaemonfor correctable errors, identify trending pages, andMADV_SOFT_OFFLINEthem before they become uncorrectable. Hyperscalers (Google, Facebook) do this at fleet scale.PR_GET_TIMERSLACK/PR_SET_TIMERSLACK— Control timer coalescing granularity. Default is 50μs — the kernel may delay your timer by up to 50μs to batch it with others (saving power). For latency-critical paths, set it to 1 (nanosecond precision). For background tasks, set it to 10ms (save power, reduce wakeups). Android sets different slack for foreground vs background apps.ioprio_set(IOPRIO_CLASS_IDLE)— Set I/O scheduling priority.IOPRIO_CLASS_IDLEmeans "only do I/O when the disk is completely idle." Backup processes, log compaction, and background defragmentation should use this so they never interfere with latency-sensitive foreground I/O. Combine withIOPRIO_CLASS_RTfor WAL writes that must never be delayed.dup3()withO_CLOEXEC— Atomically duplicate an fd with close-on-exec. WithoutO_CLOEXEC, there's a race betweendup2()andfcntl(FD_CLOEXEC)where a concurrentfork+execleaks the fd to a child. This is a real security bug class — leaked fds to child processes have caused privilege escalation in production systems.pipe2()/socket()/accept4()/epoll_create1()withO_CLOEXEC/SOCK_CLOEXEC— Same pattern. Every fd-creating syscall now has aCLOEXECvariant. Senior engineers use only these variants, never the legacy versions. The window betweensocket()andfcntl(FD_CLOEXEC)is a textbook fd leak.renameat2(RENAME_EXCHANGE)for crash-safe updates — The pattern: write new data tofile.tmp,fsync(file.tmp),fsync(directory),rename(file.tmp, file). But withRENAME_EXCHANGE, you swapfile.tmpandfileatomically — if you crash mid-swap, you still have the old file at one path and the new file at the other. Strictly better thanrename()for crash safety.sync_file_range(SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_BEFORE)— The PostgreSQL trick. Phase 1:SYNC_FILE_RANGE_WRITE(initiate writeback, don't wait). Phase 2: Continue writing more data. Phase 3:SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE(wait for phase 1 to finish, initiate writeback of phase 2). This pipelines fsync — you're always writing the next batch while the previous batch is flushing. 2x WAL throughput on fast SSDs.fcntl(F_OFD_SETLK)— Open file description locks (POSIX 2008). Unlikefcntl(F_SETLK)(which is per-process and auto-releases when any fd to the file closes), OFD locks are per-open-file-description. Multi-threaded programs can have per-thread file locks without interference. The old POSIX locks were fundamentally broken for multithreaded programs — closing any fd released all locks held by the process.pidfd_open()+pidfd_send_signal()+poll()— The complete PID-race-free lifecycle. Open a pidfd, poll it (becomes readable on exit), send signals via pidfd (never hits the wrong process). No TOCTOU between "check if alive" and "send signal." The oldkill(pid, 0)check +kill(pid, SIGTERM)pattern has a race where the PID could be recycled between the two calls. In a containerized environment cycling through PIDs rapidly, this is a real bug.close_range(3, ~0U, CLOSE_RANGE_CLOEXEC)— Mark fds 3 through MAX as close-on-exec in one syscall. Alternative toclose_range(3, ~0U, 0)(which closes them immediately). Used in security-sensitivefork+execpaths to ensure no fd leaks without the cost of iterating/proc/self/fd.
Direct Hardware Access
iopl(3)/ioperm()— Grant userspace access to x86 I/O ports. Level 3 gives access to all ports. DPDK's legacy PCI driver mode uses this. You can directly read/write NIC registers, program DMA descriptors, and handle completions — all without entering the kernel. RequiresCAP_SYS_RAWIO. Replaced by VFIO in modern stacks but still used for legacy industrial control hardware (PLCs, custom FPGA boards).mmap(/dev/mem)ormmap(/dev/kmem)— Map physical memory into userspace. Used for MMIO access to hardware registers. Embedded developers use this to poke FPGA registers. On servers, it's used for accessing ACPI tables, BIOS regions, and PCIe BARs of devices without a kernel driver. RequiresCONFIG_STRICT_DEVMEM=nor specific ranges whitelisted.VFIO_SET_IOMMU+VFIO_MAP_DMA+VFIO_DEVICE_SET_IRQS— Full userspace device driver framework. Map a PCI device's BARs into userspace, set up DMA mappings via the IOMMU, and configure MSI-X interrupts delivered via eventfd. DPDK (NICs), SPDK (NVMe), and GPU passthrough (QEMU/KVM) all use VFIO. The IOMMU ensures DMA isolation — the device can only DMA to pages you've explicitly mapped./dev/vhost-net+ioctl(VHOST_SET_MEM_TABLE)— Kernel-side virtio data plane. Set up virtqueues and memory mappings, then the kernel handles packet forwarding between a VM's virtio NIC and the host network stack without any userspace involvement. QEMU sets this up and then the data path is pure kernel. 2-5x throughput improvement over userspace virtio.ioctl(PERF_EVENT_IOC_SET_BPF)on a PMU event — Attach an eBPF program to a hardware performance counter overflow. Every N cache misses (or branch mispredictions, or TLB misses), your eBPF program runs with access to the instruction pointer, registers, and stack. Build custom profilers that collect application-specific context (request ID, tenant ID) alongside hardware events.ioctl(KVM_CREATE_VM)+KVM_SET_USER_MEMORY_REGION+KVM_RUN— Build a hypervisor in userspace. Create a VM, map its physical memory (backed by your process's virtual memory), load code, and run it.KVM_RUNreturns onVMEXIT(I/O, HLT, etc.), your code handles the exit, and you re-enter. Firecracker (AWS Lambda) is <50K lines of Rust doing exactly this. You can run a minimal VM in ~1000 lines of C./dev/uio*(Userspace I/O) — Map a device's MMIO registers and get interrupt notifications viaread(). Simpler than VFIO but no IOMMU protection (device can DMA anywhere). Used for simple FPGA boards where you trust the hardware. Write to the mmap'd region = write to the device register. Read on the fd = block until interrupt.pkey_alloc()/pkey_mprotect()/pkey_free()— Intel Memory Protection Keys. Partition your address space into up to 16 "domains" with per-thread read/write permissions. Change permissions via a single register write (WRPKRU, ~1 cycle) instead ofmprotect(syscall, TLB flush). Use cases: per-thread isolation (each request handler can only access its own data), guard pages without TLB cost, and sandboxing untrusted code within the same address space.io_uringwithIORING_REGISTER_NAPI(6.x+) — Register NAPI (NIC polling) with io_uring. When waiting for network completions, io_uring busy-polls the NIC directly, bypassing softirq. Combines the latency benefits ofSO_BUSY_POLLwith io_uring's batching model. AWS ENA and Intel ice drivers support this./dev/udmabuf(udmabuf_create) — Create a DMA-buf backed by userspace memory. Allows zero-copy sharing between userspace, the GPU, and other DMA-capable devices. Used in graphics compositing (Wayland) and ML inference pipelines where you want to pass tensors between the CPU and an accelerator without copies.perf_event_open()withPERF_TYPE_RAW— Access model-specific performance counters (Intel PEBS, AMD IBS). Things like: exact instruction that caused an L3 miss, load latency histograms, memory access source (L1/L2/L3/local DRAM/remote DRAM). This is howperf memand Intel VTune work under the hood.perf_event_open()withPERF_SAMPLE_BRANCH_STACK(LBR) — Capture the Last Branch Record — a hardware trace of the last ~32 branches taken. Gives you a mini call-stack trace for free (no frame pointers needed). Used for AutoFDO (feedback-directed optimization) — profile in production, feed branch traces to the compiler.
Networking
sendmsg()withMSG_ZEROCOPY— True zero-copy send. The kernel pins your userspace pages and DMAs directly from them. You get a completion notification viarecvmsg(MSG_ERRQUEUE). Used for 10G+ network throughput.SO_BUSY_POLL/SO_PREFER_BUSY_POLL— Spin-poll the NIC driver directly fromepoll_wait/recvmsg, bypassing softirq. Trades CPU for ~50% latency reduction. Used in HFT and latency-sensitive services.BPF_PROG_TYPE_XDP(viabpf()) — Attach eBPF programs that run at the NIC driver level before the kernel allocates ansk_buff. Can drop, redirect, or modify packets at line rate. Cloudflare's DDoS mitigation, Facebook's L4 load balancer (Katran).SO_REUSEPORT+BPF_PROG_TYPE_SK_REUSEPORT— Multiple sockets on same port with eBPF-controlled routing. Perfect consistent hashing for load balancing across worker threads without a dispatch bottleneck.TCP_FASTOPEN— Send data in the SYN packet. Saves one RTT on connection establishment. Requires a TFO cookie from a prior connection.SCM_RIGHTSviasendmsg()— Pass file descriptors between processes over Unix domain sockets. The mechanism behind graceful restarts in Envoy, Nginx, and HAProxy (hand off live connections to a new process).AF_XDPsockets — Kernel-bypass networking. Userspace reads/writes packets directly from NIC ring buffers via sharedUMEM. Near-DPDK performance without leaving the kernel's control plane.TCP_ULP(Upper Layer Protocol) — Plug a protocol handler into the TCP stack.setsockopt(TCP_ULP, "tls")offloads TLS to the kernel (kTLS). The kernel encrypts/decrypts in-place, and you cansendfile()encrypted data — zero-copy TLS. Nginx and HAProxy use this for ~30% TLS throughput improvement.SO_INCOMING_CPU— Query which CPU received data for this socket. Combined withSO_REUSEPORT+ eBPF, lets you build a NUMA-aware network stack where each core processes packets from its local NIC queue through to the application without cross-core traffic.TCP_REPAIR— Put a TCP socket into "repair mode" where you can get/set the full internal state (sequence numbers, window sizes, congestion state). Used by CRIU to checkpoint and restore live TCP connections across process migration. Also used for live migration of VMs and containers without dropping connections.SO_TXTIME/SCM_TXTIME(ETF scheduler) — Schedule packet transmission at a precise future timestamp. The kernel holds the packet and sends it at the specified time. Used in Time-Sensitive Networking (TSN) for industrial control, audio/video production, and automotive ethernet. Requires hardware support (NIC with time-based scheduling).SO_TIMESTAMPING(hardware timestamps) — Get nanosecond-precision timestamps from the NIC hardware.SOF_TIMESTAMPING_TX_HARDWAREtells you exactly when a packet hit the wire, not when the kernel queued it. PTP (Precision Time Protocol) and HFT use this. Combined withSO_SELECT_ERR_QUEUE, you can get TX timestamps asynchronously viaepoll.PACKET_FANOUTwithPACKET_FANOUT_EBPF— Distribute raw packets across a group ofAF_PACKETsockets using an eBPF program for routing decisions. Used by network monitoring tools (Suricata, Zeek) to load-balance packet processing across cores with custom flow affinity.TCP_NOTSENT_LOWAT— Set the threshold of unsent data at which the socket becomes writable. Default behavior:epollreports writable when the send buffer has space. With this, it reports writable only when unsent data drops below your threshold. Reduces bufferbloat and memory usage for write-heavy workloads. Apple invented this for macOS, Linux adopted it in 3.12.
cgroups & Resource Control (v2)
cgroup.pressure(PSI — Pressure Stall Information) — Readsomeandfullstall percentages for CPU, memory, and I/O per cgroup.some= at least one task stalled.full= all tasks stalled. You canpoll()on these files with trigger thresholds ("notify me when memory full stall exceeds 100ms in any 1s window"). Android and systemd-oomd use this for proactive OOM killing before the system falls over.memory.reclaim(5.19+) — Write a byte count to proactively reclaim memory from a cgroup. UnlikeMADV_PAGEOUT(which is per-VMA), this works on the entire cgroup. Kubernetes node agents use this for graceful memory pressure relief instead of waiting for the kernel's OOM killer.cpu.max.burst— Allow a cgroup to accumulate unused CPU quota and burst above its limit temporarily. Solves the "microservice uses 10ms every 100ms but gets throttled because the 10ms happens in a 2ms burst" problem that plagues CFS bandwidth throttling.io.latency— Set latency targets per cgroup for block I/O. The kernel dynamically throttles other cgroups to meet your target. Unlikeio.max(hard limits), this is workload-adaptive. Used for SSD-backed databases where you want latency SLOs, not throughput caps.
eBPF
bpf()(eBPF) — The meta-syscall. Attach sandboxed programs to kprobes, tracepoints, cgroup hooks, socket filters, LSM hooks, scheduler hooks. Effectively lets you extend the kernel at runtime.BPF_MAP_TYPE_RINGBUF— Lock-free, variable-length ring buffer shared between kernel eBPF and userspace. ReplacesBPF_MAP_TYPE_PERF_EVENT_ARRAYwith better performance (no per-CPU waste, no lost events with proper backpressure). The modern way to stream events from kernel to userspace.BPF_PROG_TYPE_STRUCT_OPS— Replace kernel struct_ops callbacks with eBPF programs. You can replace the TCP congestion control algorithm at runtime (bpf_struct_ops_tcp_congestion_ops). Netflix wrote a custom congestion controller this way. You can also replace the scheduler's task selection logic (sched_ext).sched_ext(via eBPF struct_ops) (6.12+) — Write your scheduler in eBPF. The kernel calls your eBPF program to make scheduling decisions (which task runs on which core). Meta usesscx_rustyfor cache-aware scheduling in production. You can implement gang scheduling, latency-optimized scheduling, or ML-driven scheduling without recompiling the kernel.BPF_MAP_TYPE_ARENA(6.9+) — Shared memory arena between eBPF and userspace at the same virtual address. Pointers work across both. Enables complex data structures (linked lists, trees) shared between kernel and userspace without serialization. A game-changer for high-performance observability.fentry/fexiteBPF program types — Attach to kernel function entry/exit with zero overhead when not attached (uses trampolines, not breakpoints). Replaces kprobes for most use cases with 5-10x less overhead. You get typed access to function arguments via BTF.
Namespace & Container Internals
setns(fd, CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWNET | ...)— Enter an existing namespace. A monitoring agent cansetnsinto a container's network namespace, run diagnostics, andsetnsback. Combined withpidfd_open+pidfd_getfd, you can inspect a container's state withoutdocker exec.pivot_root()— Atomically swap the root filesystem. Unlikechroot(which is trivially escapable withfchdiron an open fd),pivot_rootputs the old root inside the new root, then youumountit. This is how every container runtime sets up the root filesystem.chrootis not a security boundary;pivot_root+ mount namespace is.mount(..., MS_SLAVE | MS_SHARED | MS_PRIVATE)— Mount propagation.MS_SHAREDmeans mounts propagate bidirectionally between mount namespaces.MS_SLAVEmeans propagation is one-way (host→container but not container→host).MS_PRIVATEmeans no propagation. Getting this wrong in a container runtime means either mounts leak between containers (security) or USB devices don't appear inside containers (usability).clone3()withCLONE_INTO_CGROUP— Spawn a process directly into a cgroup atomically. Without this, youclone()thenwrite(pid, cgroup.procs), and there's a window where the child runs in the parent's cgroup, potentially consuming resources from the wrong budget. Kubernetes and systemd useCLONE_INTO_CGROUPto eliminate this race.unshare(CLONE_NEWUSER)+ writeuid_map/gid_map— Create a user namespace where you're root (UID 0 maps to your real UID). Inside, you can create other namespaces, mount filesystems, configure networking — all without real root. This is the entire foundation of rootless Podman and rootless Docker. Theuid_mapwrite must happen before any privilege-requiring operation in the child namespace.unshare(CLONE_NEWCGROUP | CLONE_NEWUSER | ...)— Create new namespaces without forking. You can drop a running thread into a new user namespace, gain capabilities within that namespace, mount filesystems, thensetns()back. This is how rootless containers work (Podman, user-namespaced Docker).
Advanced Process & Thread Control
clone3()withCLONE_CLEAR_SIGHAND— Clear all signal handlers atomically at fork. Without this, you fork and have a window where the child inherits aSIGTERMhandler that callsexit()on data structures it shouldn't touch. Container runtimes and process supervisors need this.prctl(PR_SET_CHILD_SUBREAPER)— Make this process the reaper for all orphaned descendants, not just direct children. systemd, Docker, and any process supervisor use this so double-forked grandchildren don't escape to PID 1. Without it, you get zombie leaks in containers.prctl(PR_SET_PDEATHSIG)— Send a signal when the parent dies. Sounds simple, but the subtlety is it tracks the thread that calledclone, not the process. If that thread exits but the process lives, your child gets a spurious death signal. Senior engineers know to call this after fork and verifygetppid()hasn't already changed (race window).waitid(P_PIDFD, ...)— Wait on a pidfd instead of a PID. Completely race-free — no TOCTOU between "is this process alive" and "wait for it." Combined withCLONE_PIDFDinclone3(), you get a fully race-free process lifecycle.prctl(PR_SET_SYSCALL_USER_DISPATCH)(5.11+) — Redirect syscalls to a userspace handler for a given address range. Windows binary emulation (Wine/Proton) uses this — when Windows code makes a syscall, instead of going to Linux's syscall table, it gets dispatched to Wine's NT kernel emulation layer. Also used for syscall interposition without ptrace overhead.
Signals & Error Handling
signalfd()/signalfd4()— Convert signals into file descriptor events. Read signals viaread()/epollinstead of async handlers. Eliminates the entire class of async-signal-safety bugs. Every well-written event loop (systemd, s6) uses this.prctl(PR_MCE_KILL, PR_MCE_KILL_EARLY)— Get SIGBUS early when a memory page has a hardware (ECC) error, instead of the kernel silently killing you later. Database engines use this to detect hardware memory corruption and fail fast on the affected page rather than silently corrupting data.
Misc
execveat(fd, "", argv, envp, AT_EMPTY_PATH)— Execute a program by file descriptor instead of path. The binary can be unlinked (deleted) — you're running directly from an open fd. Used for memfd-based execution:memfd_create→ write binary to memfd →execveat(memfd_fd). No file ever touches disk. Container runtimes use this to avoid TOCTOU between security scanning a binary and executing it.copy_file_range()with cross-filesystem support (5.3+) — Server-side copy between different filesystems via splice internally. Before 5.3, this only worked within the same filesystem. Now the kernel does an optimized in-kernel copy that avoids the userspaceread()+write()loop. For NFS→NFS copies, the server can do it without data touching the client at all.name_to_handle_at()+open_by_handle_at()— File handles survive rename/move. Get a handle (contains filesystem UUID + inode + generation), store it, and lateropen_by_handle_ateven if the file moved. NFS servers use this as their "file ID." Also usable for a lightweight file-change detection system — if the generation number changes, the inode was recycled.eventfd()+EFD_SEMAPHORE— Kernel-backed semaphore accessible via file descriptor.write(fd, &val)increments byval.read(fd)decrements by 1 (withEFD_SEMAPHORE) or drains to 0 (without). Poll-able viaepoll. Used for cross-process synchronization, VFIO interrupt delivery, and io_uring event notification. The kernel fast-path is ~50ns.timerfd_create(CLOCK_BOOTTIME_ALARM)— Timer that wakes the system from suspend. Your monitoring daemon can suspend the machine, and the timer fires → machine wakes → daemon runs health check → suspends again. Used in IoT and embedded systems for periodic sensor polling with minimal power.getrandom(buf, len, GRND_INSECURE)(5.6+) — Fast random bytes without blocking, even if the entropy pool isn't fully initialized. Returns potentially predictable bytes during early boot. For non-cryptographic uses (hash seed randomization, ASLR, load balancing), this is fine and avoids the "my service blocks for 30 seconds at boot waiting for entropy" problem that plagues headless VMs.ioctl(BTRFS_IOC_SEND)/ioctl(BTRFS_IOC_RECEIVE)— Stream a btrfs snapshot delta as a byte stream. Takes two snapshots, computes the diff, and writes a stream of commands (create file, write range, clone extent, set permissions). Pipe it over SSH for incremental backup. Used bybtrbkand every serious btrfs backup tool. The diff is computed from metadata trees, not by reading file contents — O(changes) not O(data).watch_mount()/watch_sb()(proposed/6.x) — Subscribe to mount table and superblock changes via a pipe. Replaces polling/proc/mounts. Container runtimes currentlypoll()on/proc/mountswhich is O(n) in total mounts system-wide — on a host with 10K containers and 100K mounts, this is measurably expensive.close_range()— Close a range of file descriptors in one syscall. Essential for secure process spawning — closing fds 3..MAX one-by-one was a measurable bottleneck in fork-heavy workloads.membarrier()— Issue memory barriers in remote threads without IPI in the common case. Used by userspace RCU implementations (liburcu) to avoid expensivesmp_mb()on the read side.landlock_create_ruleset()/landlock_add_rule()/landlock_restrict_self()— Unprivileged sandboxing. A process can restrict its own filesystem access without root. The modern alternative to chroot/seccomp for filesystem isolation.mount_setattr()/open_tree()/move_mount()/fsopen()/fsmount()— The new mount API (5.2+). Atomic, race-free, composable. Container runtimes are migrating from the oldmount()syscall to these.io_uring+IORING_OP_URING_CMD— Pass custom commands through io_uring to drivers (NVMe passthrough, network zero-copy). Effectively makes io_uring an extensible syscall interface.listmount()/statmount()(6.8+) — Iterate and query mount table entries by ID instead of parsing/proc/mounts. O(1) lookup by mount ID vs O(n) text parsing. Container runtimes that manage thousands of mounts care about this.mseal()(6.10+) — Seal memory mappings so they can't be changed (mprotect,munmap,mremapall fail). A mitigation against ROP/JOP attacks thatmprotectwritable→executable. Chrome's V8 and other JIT engines use this to lock down JIT pages after compilation.map_shadow_stack()(6.6+) — Allocate a shadow stack page for Intel CET (Control-flow Enforcement Technology). The CPU maintains a parallel return-address stack in write-protected memory. If a ROP gadget overwrites the real return address, it won't match the shadow stack and the CPU faults. Glibc 2.39+ enables this automatically on supported hardware.futex_waitv()(5.16+) — Wait on multiple futexes simultaneously. Windows hasWaitForMultipleObjects; Linux never had an equivalent until this. Wine/Proton needed this desperately for game compatibility — many Windows games wait on multiple synchronization objects at once.
Exotic / Niche
kcmp(KCMP_EPOLL_TFD)— Check if a specific target fd is monitored by a specific epoll instance, in another process. Checkpoint/restore (CRIU) needs this to reconstruct the exact epoll topology of a process tree.process_vm_readv()/process_vm_writev()— Read/write another process's memory in a single syscall (no ptrace attach/detach overhead, no/proc/pid/memopen). Debuggers and profilers use this for low-overhead memory inspection. Combined withpidfd_getfd(), you can inspect a process's fds and memory atomically.timerfd_create()withTFD_TIMER_CANCEL_ON_SET— Get notified when the system clock is adjusted (NTP step,settimeofday). Critical for applications that maintain time-based data structures — if the clock jumps forward 30 seconds, your timer wheel needs to know. Without this flag, your timers silently fire at the wrong time.fanotifywithFAN_REPORT_PIDFD | FAN_REPORT_FID— Filesystem-wide event monitoring that gives you pidfds (race-free process identification) and file handles (not paths — survives renames). Used by malware scanners, HSM (hierarchical storage), and audit systems. TheFAN_MODIFY+FAN_REPORT_FIDcombo lets you build a change journal like NTFS's USN journal on Linux.iopl()/ioperm()— Grant a userspace process direct access to x86 I/O ports. Used by DPDK's legacy mode to talk to NICs without kernel drivers. Extremely dangerous — you can brick hardware. Most people use UIO/VFIO instead now, but these still exist for legacy PCI devices.
Philosophy
The unifying theme across all of these: the default kernel behavior is optimized for the general case, and these syscalls exist to let you override that default when you know better. The kernel assumes you want page cache (O_DIRECT says you don't). The kernel assumes interrupt-driven I/O (SQPOLL/busy-poll says spin instead). The kernel assumes uniform memory (mbind says NUMA-aware). The kernel assumes you trust all your own threads (pkeys says isolate them).
Every one of these is a bet: "I know my workload better than the kernel's heuristics." The senior engineer's skill is knowing when that bet pays off and when the kernel's defaults are smarter than you think.
The real expertise isn't just knowing these exist — it's knowing when to reach for them. A senior engineer knows that MADV_HUGEPAGE can cause latency spikes from compaction, that MSG_ZEROCOPY has a crossover point below which the copy is faster, that SQPOLL burns a CPU but eliminates syscall overhead, and that rseq is useless if your workload already pins threads to cores.
Key References
- Kerrisk, The Linux Programming Interface, No Starch Press, 2010
- Love, Linux Kernel Development, 3rd ed., Addison-Wesley, 2010
- Gregg, Systems Performance, 2nd ed., Addison-Wesley, 2020
- Gregg, BPF Performance Tools, Addison-Wesley, 2019
man 2pages — the authoritative reference for each syscalllwn.net— in-depth coverage of every new kernel feature- Linux kernel source
Documentation/tree — especiallyadmin-guide/mm/,filesystems/,networking/,bpf/ - Axboe, "Efficient IO with io_uring", kernel.dk, 2019
- Corbet, "The rest of the 6.x merge window" series, LWN (for each kernel release)
- Facebook/Meta engineering blog — eBPF, sched_ext, PSI production usage
- Cloudflare blog — XDP, kernel bypass, DDoS mitigation architecture
- DPDK & SPDK documentation — VFIO, UIO, userspace driver patterns
See Also
- io_uring Internals — Deep dive into io_uring architecture referenced in the storage and profiling sections
- VFIO Internals — VFIO/IOMMU device passthrough referenced in the hardware access section
- ISA Critical Instructions — Hardware instructions (rseq, memory barriers, pkeys) underlying many syscalls listed here
- Filesystem Design — Filesystem internals behind O_DIRECT, fallocate, reflink, and sync_file_range