last edited: 2024-12-20 23:08:46 +0000
Trace-based Debugging
Introduction
The simplest method of debugging is to have gem5 print out traces of what it’s
doing. The simulator contains many DPRINTF statements that print trace messages
describing potentially interesting events. Each DPRINTF is associated with a
debug flag (e.g., Bus
, Cache
, Ethernet
, Disk
, etc.). To turn on the
messages for a particular flag, use the --debug-flags
command line argument.
Multiple flags can be specified by giving a list of strings, e.g.:
build/<ISA>/gem5.opt --debug-flags=Bus,Cache configs/examples/fs.py
would turn on a group of debug flags related to instruction execution but leave out Tick (timing) information. This is useful if you want to compare execution between two runs where the same instructions execute but at different rates.
Note that the gem5.fast binary does not support tracing; part of what makes it faster than gem5.opt is that the DPRINTF code is compiled out.
The --debug-flags
command line option should come after the gem5 executable
but before the simulation script. This is because debug flags are handled by
gem5 itself, and whether command line options are before or after the
simulation script determine if they’re for gem5 or the script.
Debugging Options
-----------------
--debug-break=TIME[,TIME]
Tick to create a breakpoint
--debug-help Print help on debug flags
--debug-flags=FLAG[,FLAG]
Sets the flags for debug output (-FLAG disables a
flag)
--debug-start=TIME Start debug output at TIME (must be in ticks)
--debug-file=FILE Sets the output file for debug [Default: cout]
--debug-ignore=EXPR Ignore EXPR sim objects
The complete list of debug/trace flags can be seen by running gem5 with the
--debug-help
option.
If you find that events of interest are not being traced, feel free to add
DPRINTFs yourself. You can add new debug flags simply by adding DebugFlag()
command to any SConscript file (preferably the one nearest where you are using
the new flag). If you use a debug flag in a C++ source file, you would need to
include the header file debug/<name of debug flag>.hh
in that file.
For more complex bugs, the trace can be useful in simply identifying points in
the simulation where more in-depth investigation is needed. The --debug-break
option lets you re-run your simulation under a debugger and stop on a
particular tick as identified by the trace. You can also schedule breakpoints
and enable or disable debug flags from within the debugger itself. See the page
on Debugger Based Debugging for more information.
The Exec debug flag
The Exec
compound debug flag is very useful because it turns on instruction
tracing in gem5. It makes the simulator print a disassembled version of each
instruction as it finishes executing, along with other useful information like
the time, pc, the address if it was a memory instruction, etc. These individual
pieces of information can be turned on and off with the base debug flags Exec
controls. For example, you can disable the use of function symbol names in
place of absolute PC addresses (if they’re available) by turning off the
ExecSymbol flag (e.g., --debug-flags=Exec,-ExecSymbol
).
If some supposedly innocuous change has caused gem5 to stop working correctly,
you can compare trace outputs from before and after the change using the
tracediff script in the src/util
directory. Comments in the script describe
how to use it.
Reducing trace file size
Trace file can become very large very quickly, but they also compress very well
(e.g. about 90%). If you’d like to make gem5 output a compressed trace, just
add a .gz
extension to the output file name. For example
--debug-file=trace.out
will produce an uncompressed file as normal, but
--debug-file=trace.out.gz
will produce a gzip compressed file. You can use
the zcat program and pipes to process the output. The editor vim also can
uncompress gzip compressed files in memory.
The tracediff and rundiff utilities
tracediff
and rundiff
utilities allow the simple diffing of two streams of
trace data from gem5 to find any differences. It’s very handy for debugging why
regression tests fail, figuring out why your minor code change seems to cause
some unrelated execution problem, or comparing the execution of CPU models.
Both utilities are found in the util
directory. rundiff
is a simple
diff-like program. Unlike regular diff, this script does not read in the entire
input before comparing its inputs, so it can be used on lengthy outputs piped
from other programs (e.g., gem5 traces). tracediff
is a front end for
rundiff
that provides an easy way to run two similar copies of gem5 and diff
their outputs. It takes a common gem5 command line with embedded alternatives
and executes the two alternative commands in separate subdirectories with
output piped to rundiff.
Script arguments are handled uniformly as follows:
- If the argument does not contain a ‘|’ character, it is appended to both command lines.
- If the argument has a ‘|” character in it, the text on either size of the ‘|’ is appended to the respective command lines. Note that you’ll have to quote the arg or escape the ‘|’ with a backslash so that the shell doesn’t thing you’re doing a pipe or put quotes around it.
- Arguments with ‘#’ characters are split at those characters, processed for alternatives (‘|’s) as independent terms, then pasted back into a single argument (without the ‘#’s). (Sort of inspired by the C preprocessor ‘##’ token pasting operator.)
In other words, the arguments should look like the command line you want to run, with ‘|’ used to list the alternatives for the parts that you want to differ between the two runs.
For example:
% tracediff gem5.opt --opt1 '--opt2|--opt3' --opt4
# would compare these two runs:
gem5.opt --opt1 --opt2 --opt4
gem5.opt --opt1 --opt3 --opt4
% tracediff 'path1|path2#/m5.opt' --opt1 --opt2
# would compare these two runs:
path1/gem5.opt --opt1 --opt2
path2/gem5.opt --opt1 --opt2
If you want to add arguments to one run only, just put a ‘|’ in with text only
on one side (--onlyOn1|
). You can do this with multiple arguments together
too (|-a -b -c
adds three args to the second run only).
The -n
argument to tracediff allows you to preview the two generated command
lines without running them.
For tracediff to be useful some trace flags must be enabled. The most common
trace flags to use with tracediff are --debug-flags=Exec,-ExecTicks
which
removes the timestamp from each trace making it suitable to diff when slight
timing variations are present.
Tracediff is also useful for comparing CPU models when one fails and the other
doesn’t. In this case it’s best to create a checkpoint before the problem
occurs (this can be done by just creating a bunch of checkpoints and finding
one that fails). If the failure occurs in kernel code, use the
-ExecUser
debug flag, on the other hand if it occurs in user code try the
-ExecKernel
debug flag to isolate user code in the trace. You can then
compare the traces and see when the execution diverges.
Comparing traces across machines
Sometimes gem5 executions differ inexplicably across different environments, and you’d like to use rundiff to help pinpoint where they diverge. Rather than try and reproduce those environments on the same machine, you can use netcat with rundiff to compare traces from gem5 instances running on separate systems across the network.
First, start rundiff running on one machine, configured to compare the trace output from a local instance of gem5 with the output of a netcat “server”. Since the network is likely to be the bottleneck, we’ll compress the trace going across netcat, which means we need to uncompress it as it arrives. For example (choosing port number 33335 arbitrarily):
util/rundiff 'gem5.opt --debug-flag=Exec <gem5 args> |' 'nc -d -l 33335 | gunzip -c |' >& tracediff.out &
Now go to the second machine, start a copy of gem5 there, and ship its compressed trace output to the netcat instance running on the first machine. For example:
gem5.opt --debug-flag=Exec <gem5 args> |& gzip -c |& nc <hostname> 33335
Internal Exec tracing implementation (InstTracer)
The “Trace-based debugging” section above talked about how to use the Exec
trace flag to print information about each instruction as it completes. That
functionality is actually implemented by an InstTracer
object which collects
information about instructions as the execute. These objects can be swapped
out, and different objects can do different things with the information they
collect. For instance, the IntelTrace
object prints out a trace in a
different format which is compatible with an external tool. The objects can
also do more than just print a trace. NativeTrace
objects send information
about architectural state over a socket to the statetrace tool (described
below) instruction by instruction to validate execution. InstTracer
objects
are SimObjects
which are assigned to the tracer
parameter of each CPU. If
you want to install a different tracer, just assign it to that parameter on the
CPU of interest.
When writing your own InstTracer
, you’ll write at least two different
classes, one which inherits from InstTracer
and one that inherits from
InstRecord
. The InstTracer
class’s main responsibility is to generate
InstRecord
objects which are associated with a particular instruction. By
subclassing InstTracer
, you’ll be able to return your own specialized version
of InstRecord
which is the class that really does most of the work.
The InstRecord
class have a number of fields which hold information about the
history of an instruction. For instance, InstRecord
records the instruction’s
PC, what address it used if it accessed memory, a “data” value which it
produced (multiple data values aren’t handled), etc. The InstRecord
function
also has a pointer to a ThreadContext
which can be used to read out
architectural state. When an instruction is finished executing, the
InstRecord
’s dump()
virtual function is called to process the record. For
the default InstTracer
, this is where the instruction’s assembly language
form, etc., is printed which is the output you see when you turn on Exec
. For
NativeTrace
, this is where architectural state is gathered up to send to
statetrace.
Disassembling instructions with third party disassembler
Most of the gem5 tracers (inheriting from InstTracer mentioned above) will print/dump the dynamic instruction stream together with other information (e.g. destination register values). The disassembly is generated on the fly by querying the instruction to be traced (StaticInst). Every StaticInst is supposed to define a generateDisassembly method which returns the instruction mnemonic (opcode + operand list) as a string.
From gem5 v23.1 it will be possible to hook different disassemblers to every InstTracer. A disassembler will have to implement the InstDisassembler interface defined in src/sim/insttracer.hh
By default the native disassembler (relying on generateDisassembly) will be used. To change the disassembler with a custom one (say it is called MyDisassembler), just amend the config file with:
cpu.tracer.disassembler = MyDisassembler()
Capstone disassembler
gem5 v23.1 introduces the integration with the Capstone disassembler. Capstone is an open source disassembler already used by other projects (like QEMU).
To compile gem5 with capstone support, it is necessary to install capstone first. Then the capstone disassembler will have to be instantiated in the config script. At the time of the writing only the Arm version of the disassembler has been implemented. Therefore the line to be added to the script will have to be (assuming it is an Arm simulation):
cpu.tracer.disassembler = ArmCapstoneDisassembler()
Comparing traces with a real machine
The statetrace tool runs alongside gem5 and compares execution of a workload on a real machine with execution in gem5. In the simulator and the real system, the workload is allowed to run one instruction at a time. After each instruction, architectural state is collected and compared and any differences are reported. It can be tricky to get it set up and producing useful results (described below), but it’s an extremely valuable tool for debugging because it tends to quickly pinpoint exactly where a problem is coming from, likely saving many hours of painful debugging per bug.
Native Trace
In gem5, a NativeTrace InstTracer
object (described above) needs to be
installed on the CPU that will run the workload of interest. When execution
starts, the tracer will wait for the state trace utility to connect to it.
Then, after each instruction executes, it uses the ThreadContext
pointer in
the InstRecord
object to gather architectural state from the currently
running process. It also reads in architectural state gather by state trace
through the connection they established. The two versions of state are
compared, and any meaningful differences are reported. The exact makeup of the
state and how it should be compared is very ISA dependent on ISA, so each ISA
defines its own version of NativeTrace. These specialized classes can handle
things like expected differences when registers may become undefined, or
situations where execution skips ahead for one reason or another.
statetrace utility
The statetrace utility is found in the util directory and is responsible for
running the workload on the real machine. It uses the ptrace mechanism provided
by the Linux kernel to single step the target process and to access its state.
It uses scons, but is independent of scons as used by the rest of gem5. To
build a version of statetrace suitable for a particular ISA, use the
build/${ARCH}/statetrace
target where ${ARCH}
is replaced by the ISA of
interest. Currently recognized values for ${ARCH}
are amd64
, arm
, i686
,
and sparc
. You can override the compiler used for any ISA using the CXX scons
argument, and the compiler used for a particular ISA with ${ARCH}CXX
. For
instance, to build an arm version of statetrace, you could run:
cd util/statetrace
scons ARMCXX=arm-softfloat-linux-gnueabi-g++ build/arm/statetrace
statetrace accepts four flags, -h
to print the help, --host
to specify what
ip and port gem5 is listening at, -i
to print out what’s on the initial stack
frame, and -nt
to disable tracing. -nt
is typically used with -i
to get
information about a processes initial stack without running it. The end of the
command line options is marked with two dashes. Next, put the command line you
want statetrace to run.
The exact text of the program name and arguments matters because these will be passed to the process on its stack. Longer values take up more room on the stack, that displaces other items to different addresses, and statetrace clog up with lots of unimportant differences. For instance, if you need to run a program found in your home directory in a gem5 subdirectory and you run this command:
statetrace -- ~/gem5/my_benchmark arg1 arg2
You must also override arg0 in gem5 to be ~/gem5/my_benchmark
.
Tuning
statetrace is a very sensitive system, and any minor difference between simulated execution and real execution could produce lots and lots of spurious differences. In order to get useful information from statetrace you’ll need to adjust the real system and gem5 so that everything lines up perfectly. I normally create a patch which has all the modifications I’ve made to gem5 for statetrace. Then I can easily remove them or reapply them for as I find and fix problems. Mercurial queues is useful for managing that patch and patches for my fixes. The following is an incomplete list of the differences you may have to correct.
Address randomization: To improve security, Linux will randomize the address
space of processes, moving around their stack and heap areas. This makes it
harder for an attacker to predict what memory will look like, but it also
thoroughly defeats statetrace. To disable it, echo 0
into
/proc/sys/kernel/randomize_va_space
. You’ll almost certainly need root
permissions to do that.
argv values: Be sure to use exactly the same text for each argument to your program in gem5 and on the real system. This includes arg0, the program name.
File block size: Glibc uses the block size associated with a file to decide how
to buffer it. Different behavior will throw off execution and prevent
statetrace from working. You can change the block size gem5 reports in the
convertStatBuf
and convertStat64Buf
functions in src/sim/syscall_emul.hh
.
Initial stack contents: Depending on your version of Linux, the contents of the
initial stack may be different. You can use the -i
and -nt
options to print
out the content of the initial stack on the real machine. statetrace attempts
to interpret the initial stack so you can more easily see what’s on it. You’ll
need to adjust how gem5 sets up the stack to match your real system. This code
is typically in a file called process.cc
in the appropriate arch directory.
gem5’s code has been painstakingly constructed so that it sets up a stack as
identically to Linux as possible, but the underlying mechanism would change.
Also, Linux puts a collection of auxiliary vectors on the initial stack. These
are type, value pairs which let the kernel provide extra information to the
process as it starts. From time to time Linux introduces a new type of
auxiliary vector and adds it to the stack. You may need to dig into the Linux
source and emulate any new entries.
Caveats
Because statetrace is very sensitive to any changes in execution, it can’t be
used with programs that don’t behave in very predictable ways. For instance, if
a program reads in a random value from /dev/random
and uses that in a
calculation (or worse in control flow) then that program can’t be used. Less
obviously, if the program relies on the system time which is unpredictable, it
also can’t be used. Generally speaking, many benchmarks try to be very
deterministic so that they can be used to generate reproducible data. That
makes them work well with statetrace.
Statetrace can’t be used at the operating system level for at least two main
reasons. First, no system is implemented or will be implemented in the
foreseeable future for single stepping an operating system. Second, real
operating systems are not determinstic. Interrupts from hardware devices will
almost certainly come in at unpredictable times, some devices will return
unpredictable data, and gem5 is much less likely to exactly match the behavior
of a system at that level where firmware and other implementation details are
non longer abstracted away. Second the amount of state that’s relevant at the
system level is typically larger than at the user level, especially in complex
ISAs like x86
. Gathering, comparing, and transporting all that extra state
would significantly impact performance.
Not all implementations of ptrace actually work properly. For instance when I
last used statetrace with ARM
, certain functions called into a region of
memory set up by the kernel which had kernel specific implementations of for
various operations. Ptrace relied on software breakpoints which work by
replacing the next instruction in the program with one that will trap. Because
the region of memory really belonged to the kernel, ptrace couldn’t modify it
to install a breakpoint. The process “escaped” single stepped execution and
quickly ran to completion, leaving gem5 waiting for an update that never came.
statetrace isn’t able to track changes to memory. Because memory is very large and there isn’t a convenient way to detect modifications to it, statetrace only tracks register based architectural state. If an instruction changes registers correctly but stores the wrong value to memory and/or to the wrong address, that problem may not be detected for many instructions. Fortunately, those sorts of errors are the exception.
To compare execution to a real machine, you ideally need to have a real machine at your disposal. It’s still quite possible, however, to run statetrace inside an emulator like qemu. That’s likely a little slower and compares execution against the emulator and not real hardware, but it can still help identify bugs.
ISA support
Currently SPARC
, ARM
, and x86
support state. ARM’s support is currently
the most sophisticated, only sending differences in state across the connection
which improves performance, and only printing when differences start or stop
which reduces output and improves readability. Those features are planned to be
ported to the other ISAs. Hopefully that code can be factored out and put into
the base NativeTrace
class so that all ISAs can use it easily.