This is the parser for the instruction partitioning data (IPD) files.
These files are text-based descriptions of the functions and basic blocks used by the partitioner and allow the user to seed the partitioner with additional information that is not otherwise available to the partitioner.
For instance, the analyst may know that a function begins at a certain virtual address but for some reason the partitioner does not discover this address in its normal mode of operation. The analyst can create an IPD file that describes the function so that the Partitioning process finds the function.
An IPD file is able to:
-
specify an entry address of a function that is otherwise not detected.
-
give a name to a function that doesn't have one.
-
specify whether the function ever returns to the caller.
-
list additional basic blocks that appear in the function.
-
specify the address of a basic block that is otherwise not detected.
-
indicate that a basic block is semantically equivalent to another basic block.
-
override the control-flow successors for a basic block.
The language non-terminals are:
File := Declaration+
Declaration := FuncDecl | BlockDecl
FuncDecl := 'function' Address [Name] [FuncBody]
FuncBody := '{' FuncStmtList '}'
FuncStmtList := FuncStmt [';' FuncStmtList]
FuncStmt := ( Empty | BlockDecl | ReturnSpec )
ReturnSpec := 'return' | 'returns' | 'noreturn'
BlockDecl := 'block' Address Integer [BlockBody]
BlockBody := '{' BlockStmtList '}'
BlockStmtList := BlockStmt [';' BlockStmtList]
BlockStmt := ( Empty | Alias | Successors ) ';'
Alias := 'alias' Address
Successors := ('successor' | 'successors') [SuccessorAddrList|AssemblyCode]
SuccessorAddrList := '{' (AddressList | AddressList '...' | '...') '}'
AddressList := Address ( ',' AddressList )*
Address: Integer
Integer: DECIMAL_INTEGER | OCTAL_INTEGER | HEXADECIMAL_INTEGER
Name: STRING
AssemblyCode: asm '{' ASSEMBLY '}'
Language terminals:
HEXADECIMAL_INTEGER: as
in C,
for example: 0x08045fe2
OCTAL_INTEGER: as
in C,
for example, 0775
DECIMAL_INTEGER: as
in C,
for example, 1234
STRING:
double quoted. Use backslash to
escape embedded
double quotes
ASSEMBLY: x86 assembly instructions (must contain balanced curly braces, if any)
Comments begin with a hash ('#') and continue to the end of the line. The hash character is not treated specially inside quoted strings. Comments within an ASSEMBLY terminal must conform to the syntax accepted by the Netwide Assembler (nasm), namely semicolon in place of a hash.
Semantics
A block declaration specifies the virtual memory address of the block's first instruction. The integer after the address specifies the number of instructions in the block. If the specified length is less than the number of instructions that ROSE would otherwise place in the block at that address, then ROSE will create a block of exactly the specified size. Likewise, if the specified address is midway into a block that ROSE would otherwise create, ROSE will create a block at the specified address anyway, causing the previous instructions to be in a separate block (or blocks). If the specified block size is larger than what ROSE would otherwise place in the block, the block will be created with fewer instructions but the BlockBody will be ignored.
A function declaration specifies the virtual memory address of the entry point of a function. The body may specify whether the function returns. As of this writing [2010-05-13] a function declared as non-returning will be marked as returning if ROSE discovers that a basic block of the function returns.
If a block declaration appears inside a function declaration, then ROSE will assign the block to the function.
The block 'alias' attribute is used to indicate that two basic blocks perform the exact same operation. The specified address is the address of the basic block to use instead of this basic block. All control-flow edges pointing to this block will be rewritten to point to the specified address instead.
Example file:
function 0x805116 "func11" { # declare a new function named "func11"
returns; # this function returns to callers
block 0x805116 { # block
at 0x805116 is part of func11
alias 0x8052116, 0x8052126 # use block 0x805116
in place of 0x8052116 and 0x8052126
}
}
Basic Block Successors
A block declaration can specify control-flow successors in two ways: as a list of addresses, or as an x86 assembly language program that's interpretted by ROSE. The benefits of using a program to determine the successors is that the program can directly extract information, such as jump tables, from the specimen executable.
The assembly source code is fed to the Netwide Assembler, nasm (http://www.nasm.us/), which assembles it into i386 machine code. When ROSE needs to figure out the successors for a basic block it will interpret the basic block, then load the successor program and interpret it, then extract the successor list from the program's return value. ROSE interprets the program rather than running it directly so that the program can operate on unknown, symbolic data values rather than actual 32-bit numbers.
The successor program is interpretted in a context that makes it appear to have been called (via CALL instruction) from the end of the basic block being analyzed. These arguments are passed to the program:
-
The address of an "svec" object to be filled in by the program. The first four-byte word at this address is the number of successor addresses that immediately follow and must be a known value upon return of the program. The following values are the successors–either known values or unknown values.
-
The size of the "svec" object in bytes. The object is allocated by ROSE and is a fixed size (8192 bytes at the time of this writing–able to hold 2047 successors).
-
The starting virtual address of the first instruction of the basic block.
-
The address immediately after the last instruction of the basic block. Depending on the Partitioner settings, basic block may or may not be contiguous in memory.
-
The value of the stack pointer at the end of the basic block. ROSE creates a new stack before starting the successor program because the basic block's stack might not be at a known memory address.
The successor program may either fall off the end or execute a RET statement.
For instance, if the 5-instruction block at virtual address 0x00c01115 ends with an indirect jump through a 256-element jump table beginning at 0x00c037fa, then a program to compute the successors might look like this:
block 0x00c01115 5 {
push ebp
mov ebp, esp
; ecx is the base address of the
successors return vector,
; the first element of which is the vector size.
mov ecx, [ebp+8]
; loop over the entries
in the jump table, copying each
; address from the jump table to the svec return value
xor eax, eax
loop:
cmp eax, 256
je done
mov ebx, [0x00c037fa+eax*4]
mov [ecx+eax*4], ebx
inc eax
jmp loop
done:
; set the number of entries
in the svec
mov ecx, [ebp+8]
mov DWORD [ecx], 256
mov esp, ebp
pop ebp
ret
Example Programmatic Usage
The easiest way to parse an IPD file is to read it into memory and then call the parse() method. The following code demonstrates the use of mmap to read the file into memory, parse it, and release it from memory. For simplicity, we do not check for errors in this example.
int fd = open("test.ipd", O_RDONLY);
struct stat sb;
fstat(fd, &sb);
const char *content = (char*)mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
munmap(content, sb.st_size);
Definition at line 1750 of file Partitioner.h.