This document describes the Kerninst Application Programming Interface (KAPI), which allows a programmer to build tools for modifying an OS kernel on the fly. There are many reasons why you may want to do this, for example: performance profiling (inject timers at selected locations in the kernel), tracing (insert calls to tracing routines anywhere you want), dynamic code optimization (replace a function with a better-optimized version on the fly).
If you are looking for an in-depth description of the KAPI classes and methods, please refer to the Kerninst API Reference Guide. For installation information, refer to the 'INSTALL' file after downloading and unpacking either the source or binary distribution.
Sparc/Solaris |
IA-32/Linux |
PowerPC/Linux |
|
Processor Architecture: | Sun UltraSparc I, II, or III | Intel or AMD x86 | IBM PowerPC |
SMP Support: | Yes | Yes | Yes |
OS Version: | Sun Solaris 7 or 8 | Linux 2.4 or 2.6 | Linux 2.4 or 2.6 |
32-bit / 64-bit: | Yes / Yes | Yes / No | No / Yes |
A typical KAPI-based system is shown in Figure 1. The end user launches and interacts with an instrumentation tool written using C++ to instrument the kernel of a machine on the network. For the remainder of this document, we will refer to this tool as the mutator. The mutator translates end user's requests into calls to KAPI. The goal of this document is to describe how programmers can write their own mutators.
One of the key features of the KAPI is support for remote instrumentation. A mutator can run on one machine and instrument another machine's kernel over the network. To support this functionality, the system contains the third component, the Kerninst daemon, or kerninstd. Kerninstd listens for incoming network connections from mutators and performs their instrumentation requests. This structure allows the programmer to isolate the machine resources used by the mutator program from the kernel on the machine being instrumented.
Finally, to implement certain low-level operations that cannot be done from the user space, we install a small kernel driver on the machine being instrumented. This driver is associated with the /dev/kerninst device node to allow communication with kerninstd.
The KAPI is based on two abstractions, instrumentation points and snippets. Instrumentation points are places of interest within the kernel where instrumentation can be inserted. Instrumentation snippets represent some portion of executable code to be inserted at an instrumentation point. As an example, if we wished to know the number of times some kernel function was invoked, the instrumentation snippet would be an expression to increment a counter and the point would be the first instruction of the function.
In general, points can be any instruction in the kernel or loadable modules. However, the most common points are the entries and exits of functions and basic blocks. In the KAPI, the only information a programmer needs to define an instrumentation point is an address within the kernel.
Snippets can be used to represent most programming constructs,
including but not limited to conditionals, assignments, and function calls.
For a complete list of the types of snippets, refer to information on the
'kapi_snippet' class in the KAPI Reference Guide. Snippets are expressed in
Abstract Syntax Tree (AST) form. The code in AST form is best visualized as a
tree, with intermediate nodes representing the operators in the expression and
leaves representing the operands. The post-order traversal of the tree
(evaluate children, compute your own value from the children's value) will
produce the desired value of the entire expression. For example, an AST for
the expression 'A = ((B * C) + D) / (E - F)
' is shown in the
following figure.
In KAPI code, snippets (AST expressions) take the following form:
expression_type(operator_type, operand [, operand ...])
, where the operands are also snippets. Referring back to the previous
expression, the KAPI code to represent the expression would be as follows:
kapi_int_variable A(addr_of_A);
kapi_int_variable B(addr_of_B);
kapi_int_variable C(addr_of_C);
kapi_int_variable D(addr_of_D);
kapi_int_variable E(addr_of_E);
kapi_int_variable F(addr_of_F);
...
kapi_arith_expr assign_A(kapi_assign, A,
kapi_arith_expr(kapi_divide,
kapi_arith_expr(kapi_plus,
kapi_arith_expr(kapi_times, B, C),
D),
kapi_arith_expr(kapi_minus, E, F)));
Note that the above is just one way to code the assignment to A. Another one follows:
kapi_arith_expr B_times_C(kapi_times, B, C);
kapi_arith_expr dividend(kapi_plus, B_times_C, D);
kapi_arith_expr divisor(kapi_minus, E, F);
kapi_arith_expr assign_A(kapi_assign, A, kapi_arith_expr(kapi_divide, dividend, divisor));
The advantages of using declared variables for common expressions are obvious, reuse and less typing.
This section provides a brief overview of the programming interface by showing how the count_func example mutator is implemented. This mutator allows the user to monitor how often a particular kernel function is executed. It does this by injecting a small fragment of code at the entry point of the chosen kernel function to increment a counter every time the function is invoked. The implementation of this mutator follows a pattern representative of many other instrumentation tools. We summarize the main steps below.
The following is the source code listing for 'count_func.C', annotated with explanatory comments.
// count_func.C - mutator to count the number of entries to a user-specified
// kernel function
#include <iostream>
#include <time.h>
#include <stdlib.h>
#include <unistd.h>
// We must include the KAPI library header
#include "kapi.h"
using namespace std;
// Correct usage is: "count_func hostname port# module function"
// "hostname" - the name of the host where kerninstd is running
// "port#" - the listening port supplied by kerninstd
// "module" - the name of a kernel module
// "function" - the name of a function of interest within module
int usage(char *arg0)
{
cerr << "Usage: " << arg0 << " <hostname> <port#> <module> <function>\n";
return (-1);
}
// The global kapi_manager object, starting point for all KAPI interactions
kapi_manager kmgr;
// Useful function that detaches from kerninstd when an error is detected
// on return from some KAPI call
void error_exit(int retval)
{
kmgr.detach();
exit(retval);
}
int main(int argc, char **argv)
{
int rv = 0;
if (argc != 5) {
return usage(argv[0]);
}
// Grab the arguments
const char *host = argv[1];
int port = atoi(argv[2]);
const char *modName = argv[3];
const char *funcName = argv[4];
// Attach to kerninstd on the target host using the supplied port
if ((rv = kmgr.attach(host, port)) < 0) {
cerr << "attach error\n";
return rv;
}
// Find the module of interest, initialize kmod
kapi_module kmod;
if ((rv = kmgr.findModule(modName, &kmod)) < 0) {
cerr << "findModule error\n";
error_exit(rv);
}
// Find the function of interest, initialize kfunc
kapi_function kfunc;
if ((rv = kmod.findFunction(funcName, &kfunc)) < 0) {
cerr << "findFunction error\n";
error_exit(rv);
}
// Find the instrumentation point: the entry to the function
kapi_vector
As can be seen from the example in section 4, even simple expressions require a moderate amount of code to generate the corresponding KAPI snippet. A progammer may find it tedious to implement complex functions using only KAPI snippets. However, it is a very common occurence to require complex functions, as is the case for performing function replacement, where the programmer substitutes their own version of a kernel function in its place. As such, it is suggested that the programmer make use of kernel modules written in C for implementing complex functions. When there are only a couple such functions required, it is recommended to code the functions within the Kerninst driver source. If the programmer requires many complex functions, it is suggested that a new module be written (possibly modeled after the Kerninst driver source and build files) that contains the necessary functionality. After building the module and loading it into the kernel, the functions will be available to call from KAPI instrumentation snippets. Note that only loadable modules inserted before the kerninst driver will be available for instrumentation and analysis by KAPI, so any external modules written by the user should be loaded before the kerninst driver.
One of the most useful parts of the KAPI is the ability to read
the performance counters of the target machine's processor. The
kapi_hwcounter_expr
snippet is defined to represent a reading
of the processor counter associated with the kapi_hwcounter_kind
used to define the snippet. A popular kapi_hwcounter_kind
value is HWC_TICKS, which represents the processor tick counter,
a register that keeps track of the number of
processor cycles that have been executed since the machine was booted. To
implement a fairly accurate function timer, one just needs to read the
tick register at entry and exit to the function, calculate the difference,
and convert to a time value based on the speed of the processor. Unfortunately,
a number of issues complicate the use of such timers. For example, the simple
scheme given does not account for function recursion, nor is it safe for
use on multiprocessor machines. Perhaps even worse is that it does not
account for the possiblity that a process context-switch occurs between
the two tick readings. In order to deal with these issues, we have written
a KAPI timer library that provides support for two types of timers implemented
to safely handle the above problems.
The timer library, named 'kapi_timer_lib.h' and provided along with the
other example mutators, supports virtual timers
defined to represent any of the values of kapi_hwcounter_kind
.
A virtual timer measures hardware counter events occuring from timer
start to stop, excluding "time" switched-out due to process context-switches.
A wall timer measures the virtual time, and additionally adds time
spent switched-out for all threads of execution within the code. Thus, wall
timers provide an insight into the total work time of threads (thread-seconds,
analogous to man-hours) from timer start to stop. Wall timers are currently
only supported for HWC_TICKS.
Both types of timers are kept on a per-processor basis and are recursion-safe.
An overview of the interfaces to the timer library is given below.
ierr initialize_vtimer_environment(kapi_manager &the_mgr)
This function must be called before any of the others in the library, and is responsible for initializing the kernel environment and library state used to support timer management. The most important action of this function is to locate the context-switch handling routines in the kerninst driver and insert calls to these routines at the kernel context-switch points. The only argument is a reference to the global kapi_manager object. Like most other KAPI methods, the return value is zero on success, or a negative error code.
int add_vtimer(kapi_hwcounter_kind type,
const kapi_vector<kapi_point> &start_points,
const kapi_vector<kapi_point> &stop_points,
bool is_wall,
data_callback_t sample_routine = NULL,
unsigned sample_freq_ms = 1000)
This function can be used to add a timer of the specified type
.
When a timer is added, the following actions are taken: (1) memory to hold
the timer's per-processor values is allocated in the kernel, (2) timer start
and stop routines are placed at the requested points, and (3) sampling of
the timer's values is started (if requested).
The start_points
vector is the set of points at which to place
timer start code, and stop_points
similarly represents the points
at which to place stop code. The boolean is_wall
parameter, when
set to true, denotes that the timer should be a wall timer
instead of a virtual timer (NOTE: wall timers are currently only
supported for type HWC_TICKS). The last two parameters can be used to
specify the sampling behavior for this timer. If sample_routine
is set to NULL, no sampling will be done (although this pretty much makes
the timer useless). If you desire the timer's per-processor values to be
sampled, provide the name of an appropriate sample handling routine to be
invoked when samples of this timer arrive
(NOTE: invocation of the sample handler routine
is taken care of by the call to kapi_manager::handleEvents()
).
A default sample handling routine is defined by the library, and named
default_sample_callback
. The sample_freq_ms
parameter defines the desired sample frequency in milliseconds, and has
a default value of one second.
The function returns a timer handle to be used for later removing the timer,
or a negative error code.
void remove_vtimers(const kapi_vector<int> &vtimer_id_list)
This function can be used to deactivate the timers whose handles are
listed in the vtimer_id_list
parameter. Deactivation includes
removal of the timer start and stop code, stopping sampling for the timer,
and deallocating memory allocated to hold the timer values.
ierr cleanup_vtimer_environment(void)
This function must be called last and is responsible for cleaning up the kernel environment and library state used to support timer management. This function is responsible for deactivating any remaining active timers and removing the calls to the context-handling routines at the kernel context-switch points. Like most other KAPI methods, the return value is zero on success, or a negative error code.
For an in-depth example of how to use the KAPI timer library, see the example mutator named 'time_func.C', which shows how to use a timer on a specified kernel function. Also, there is an example mutator for each supported platform, named '<processor-type>_perfctr_example.C', that shows how to use the timer library for monitoring specific processor performance counters.
By default, all instrumentation snippets inserted into the kernel will
be active for all threads in execution at the instrumentation points. The
KAPI does provide a mechanism for performing instrumentation only when
a specified process (or list of processes) is executing at the point. This
is accomplished using the kapi_pid_expr, which evaluates to the process id
for the current thread of execution at the point. An example mutator
named 'pid_predicate.C' is provided that
performs the same function as the count_func example described in
section 5, but adds predication on a list
of process id's supplied on the command line. Here we show the interesting
portion of this example mutator's code:
// Generate code to increment the variable if process id // in pid list kapi_pid_expr pid_expr; kapi_bool_expr check_pids(kapi_eq, pid_expr, kapi_const_expr(pid_list[0])); for(unsigned i = 1; i < num_pids; i++) { check_pids = kapi_bool_expr(kapi_log_or, check_pids, kapi_bool_expr(kapi_eq, pid_expr, kapi_const_expr(pid_list[i]))); } kapi_arith_expr increment(kapi_assign, intCounter, kapi_arith_expr(kapi_plus, intCounter, kapi_const_expr(1))); kapi_if_expr code(check_pids, increment);
pid_expr
evaluates to one of the supplied pids in the pid_list
array,
and if so execute the increment of the variable intCounter
.
Function replacement is the ability for a programmer to substitute an alternate implementation of a kernel function for the default implemenatation. There are two independent methods provided by the KAPI to perform function replacement. Both methods are implemented to assume that the arguments to the original and replacement implementation of the function being replaced are identical. It is suggested that replacement versions of kernel functions be implemented in a kernel module rather than via KAPI snippets to assure correctness with respect to calling conventions and kernel data types.
The first method provided by KAPI, which we refer to as true function replacement, instruments the original function at its very first instruction to place an unconditional control transfer to the replacement. After the instrumentation has been performed, all calls made from anywhere in the kernel to the original function will be redirected to the replacement function. See the example mutator 'replace_function.C' for an example of how to do true function replacement.
The second method provided by KAPI, which we refer to as callsite function replacement, instruments a single callsite to replace the destination address of the call with the entry address of the replacement function. After the instrumentation has been performed, any time the modified callsite is executed, the replacement function will be called instead of the original. See the example mutator 'replace_function_call.C' for an example of how to do callsite function replacement.
The code analysis framework of KAPI allows the programmer to walk the
static call graph of the kernel by starting at a top-level function,
descending into its callees, and so on. See the documentation for
kapi_function::getCallees()
in the KAPI Reference Guide for
details. Unfortunately, kernel code is full of indirect calls, where
targets are not known in advance. As a result, the static approach may
not be able to find all callees of some kernel function. Fortunately, the
KAPI provides primitives for discovering indirect callees at run time by
instrumenting the corresponding callsites and snooping on destination
addresses as the calls are made. The functions provided by KAPI for this
purpose are documented in the 'Handling Indirect Calls' subsection of
the class kapi_manager
description in the KAPI Reference Guide.
An example mutator named 'watch_callees.C' is also provided as a point of
reference for performing callee watching.