Malware analysts spend appreciable time inspecting the situations beneath which malicious applications will take sure actions. For instance, take into account a malware program that incorporates a examine for the presence of a debugger, a typical approach meant to hinder evaluation. The analyst might want to know if there’s a viable execution path that circumvents this examine and, if there’s a path, what inputs and environmental situations are wanted to traverse it. We name this sort of reasoning path discovering. The paths to search out could be primarily based on quite a few standards, corresponding to user-specified begin and finish addresses or passing by particular program factors. CERT’s reverse engineers and malware analysts have discovered path discovering is beneficial when analyzing malware.
In a earlier submit, we mentioned Kaiju, the CERT/CC‘s extension framework to Ghidra, the Nationwide Safety Company’s software program reverse engineering (SRE) instrument suite. Kaiju consists of many instruments to help malware evaluation and reverse engineering. One of many extra advanced plugins included in Kaiju is a Satisfiability Modulo Theories (SMT) primarily based path evaluation instrument named GhiHorn. On this submit I delve deeper into GhiHorn to debate the way it works and the way it may be used to unravel path evaluation issues.
Path Discovering With out Supply Code
As famous above with the debugger detection instance, we consider that many widespread challenges in malware evaluation and reverse engineering could be framed by way of discovering a path to a particular level in a program. On the binary degree we don’t have the additional info that you just sometimes have in supply code. For instance, as an alternative of programmer-named variables we should purpose about CPU registers or reminiscence values. Moreover, optimizations carried out throughout compilation might end in unique or convoluted code preparations that make it exhausting to acknowledge widespread program options.
Recall the instance from our earlier weblog submit on binary path discovering, which is proven in Determine 1. Discovering a path from line 1 to the goal at line 20 (marked as “Goal line!”) is comparatively simple by visible inspection: we all know that native integer variable x
should be set to 42
due to the situation at line 19, which might solely occur at line 12 when variable y
is the same as 2
, which in flip will depend on the three integer enter parameters, i
, j
, and ok
being set to 6
, 7
, and 8
, respectively.
Determine 1: Instance program to exhibit path discovering
Nevertheless, when this program is compiled, quite a few complexities emerge:
- Variable names and kinds are misplaced.
- Optimizations might end in convoluted code buildings.
- There are numerous meeting directions which are mixed to kind higher-level operations.
- Every CPU gives its personal instruction set structure (ISA) with particular calling conventions, registers, and operations.
From Pharos to Ghidra
We’ve been creating instruments to evaluate path viability in our Pharos Binary Evaluation Framework for a while. Pharos’ path discovering instruments encode program management movement graphs as SMT assertions which are solved by the Z3 theorem prover. The unique encoding scheme utilized in Pharos relies on mutually unique Z3 assertions which are generated utilizing Pharos’ evaluation. Representing a program utilizing this scheme turned out to be cumbersome for a number of causes:
- The notation was not designed for human legibility, e.g., knowledge and management movement buildings and the transitions between them have been made utilizing normal Z3 assertions, which made encoding management movement unnatural and exhausting for analysts to know.
- Widespread program phenomena have been difficult to mannequin, corresponding to calling capabilities, recursion, and complicated knowledge varieties.
- State variables grew to become derived symbolic values generated by Pharos that could possibly be exhausting to map again to significant program buildings.
Our different Pharos-based path evaluation instrument is ApiAnalyzer. ApiAnalyzer traverses program management movement graphs on the lookout for sequences of Utility Program Interface (API) perform calls that match prescribed behavioral signatures. The issue of figuring out a path in a program that satisfies particular constraints, corresponding to traversing a sequence of API perform calls, lends itself properly to path discovering. Our new work thus seeks to reframe this API evaluation by way of path discovering.
The Pharos method to program evaluation was initially designed to be easy, quick, and helpful for malware analysts. Many of those objectives have been achieved on the expense of deeper analyses. It seems that superior path evaluation requires further constancy that Pharos had issue offering.
Pharos can reply sure questions rapidly, for instance: Is the worth used at X the identical as the worth used at Y? Since Pharos focuses on velocity, nevertheless, it trades off how advanced paths can grow to be earlier than accuracy is now not assured. In consequence, reasoning about advanced program paths in Pharos is difficult. As famous above, we have been already exploring extra rigorous path evaluation utilizing SMT solvers in Pharos when Ghidra was launched. The Ghidra decompiler, with its wealthy programming API, opened new and thrilling methods to method path evaluation issues.
GhiHorn
We’ve created a brand new Ghidra-based instrument named GhiHorn (See Determine 2 under) to benefit from our current advances and to supply an extensible framework for binary-path evaluation issues. GhiHorn helps reverse engineers and malware analysts reply fascinating questions corresponding to
- Does a path exist to a specified level in a program (i.e. feasibility)?
- If there’s a path, what values must be assigned to program variables to succeed in it?
- If there may be not a possible path, why?
- Does the trail point out an fascinating or indicative conduct?
Determine 2: Ghihorn Person Interface
We have named this new Kaiju instrument “GhiHorn” (GHI-dra HORN-ifier), consistent with the custom of comparable source-code evaluation instruments utilizing Horn clauses, together with SeaHorn (a C language hornifier) and JayHorn (a Java hornifier). GhiHorn is created within the spirit of those different verification instruments, however it operates on Ghidra-generated knowledge buildings, particularly p-code. Ghidra’s decompiler and p-code language present strong details about program semantics for various architectures. In line with Ghidra’s p-code documentation:
A p-code operation is the analog of a machine instruction. All p-code operations have the identical primary format internally. All of them take a number of varnodes as enter and optionally produce a single output varnode.
Varnodes in Ghidra are knowledge components represented as triples (reminiscence area, offset, and measurement) on which p-code operates. P-code and varnodes are vital to Ghidra’s decompilation course of. Ghidra generates each p-code and varnodes for every instruction in a program throughout preliminary program evaluation and disassembly. The p-code and varnodes initially generated are uncooked within the sense that they’re solely meant to symbolize the instruction semantics with little or no high-level info gleaned from greater order evaluation.
Throughout decompilation, pcode and varnodes are refined and related to summary native variables and source-code degree knowledge buildings. We time period this “excessive p-code” as a result of it’s certain to knowledge buildings in Ghidra that embrace decompilation info, corresponding to HighVariables
and HighFunctions
. Happily, the construction of excessive p-code lends itself to SMT-based encoding.
With Ghidra offering the information and management movement buildings essential to symbolize a path, we want a solution to encode this system buildings for an SMT solver. Enter Horn clauses, that are a particular encoding for verification situations that may be constructed routinely from program management movement buildings. Researchers have continued to make advances in Horn clause solvers, and lots of SMT solvers, together with Z3, now embrace Horn solvers, which makes Horn clause encoding a viable resolution for path evaluation issues. We delve into among the particulars of how GhiHorn encodes p-code under.
GhiHorn Encoding
GhiHorn encodes Ghidra p-code as SMT-Lib Horn clauses appropriate for the Z3 solver. Horn clauses are rule-like constraints expressed as implications. Horn clauses can symbolize transitions by a management movement graph of the logical kind:
𝐼𝑛𝑝𝑢𝑡 𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛 ∧𝐶𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡𝑠⇒𝑂𝑢𝑡𝑝𝑢𝑡 𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛Input Location ∧Constraints⇒Output Location
the place an enter location conjoined with constraints on state variables transitions to an output location. In program phrases, the enter location is the originating primary block, the constraints are situations over program variables, and the output location is a succeeding primary block. An instance of a conditional expression and the related Horn rule generated for it are proven in Desk 1 under. When management movement arrives at Line 1 and the variable x
is 42
, then management movement might progress to Line 2. As a result of this can be a resolution level within the code, a second rule is required to transition from Line 1 to Line 3. Taken collectively, these guidelines mannequin the conditional construction proven within the following supply code.
Desk 1: Instance of a conditional expression and the related Horn rule generated for it
Ghidra gives a management movement graph from which the enter and output elements of the principles could be derived. On condition that p-code represents machine directions, you should use them to symbolize state transitions inside blocks. Translating p-code statements into Z3 expressions then seems to be comparatively straight ahead. For instance, here’s a advanced supply code assertion (proven in Desk 2) that was generated by Ghidra’s decompiler: ((param_1 >> 2) + 1U & 0xff) == 0x55. This instance is used as a result of it decomposes to a number of p-code operations, and these could be mapped to Z3 expressions. Be aware that the supply code operations, corresponding to >> have analogs in each p-code (INT_SRIGHT
) and in Z3 SMT (bvashr
). For essentially the most half variables are represented as 64-bit bit vectors.
Be aware that the varnodes current in p-code operations are changed with variables within the last Z3 expression (e.g., param_1
). That is attainable as a result of, throughout decomposition, Ghidra assocates varnodes with higher-level decompiler components, corresponding to variables. Working on significant knowledge components corresponding to variables gives for far more fascinating outcomes, that are mentioned under.
Desk 2: Decompilation, p-code, and Z3 expressions.
Because the desk above illustrates, the decompilation was generated by Ghidra. The p-code operations embrace varnodes represented as triples: (reminiscence area, offset, measurement). The Z3 expressions are all equalities of the shape output = operations enter.
After every primary block is hornified it’s organized right into a set of Z3 guidelines. Every rule is an implication the place the antecedent is the supply primary block and the consequence is the sink block. A whole instance primarily based on the Ghidra decompilation proven in Desk 2 is proven in Determine 3. The rule captures the transition from the fundamental block at deal with 0x100003f60 to the fundamental block at deal with 0x100003fa2. The constraints seize situations on param_1 taken from the p-code operations that should be true to allow the transition.
Determine 3: Horn rule generated by GhiHorn
A group of those guidelines is generated to symbolize a management movement graph for a program. Be aware that variables are handed as state (i.e., arguments) to each the enter and output blocks. On this approach, program state is up to date and maintained. In the end the encoded program is handed to Z3, and GhiHorn queries for a state (i.e., the deal with current within the encoding) to find out if a viable path is current.
Though this method to reasoning about program paths is intuitive it requires further buildings to mannequin actual applications. For simplicity’s sake, all reminiscence operations are carried out on a single giant reminiscence array (named Reminiscence
in Determine 3) that’s managed by Z3. This method retains issues easy however could be restricted, and it’s under no circumstances performant. In future variations of GhiHorn we plan to enhance reminiscence modeling by higher dealing with pointers and higher illustration of various reminiscence areas.
One other downside is how you can deal with exterior dependencies, corresponding to imported API capabilities for which the code will not be obtainable. GhiHorn gives a functionality to construct a simulated API ecosystem by compiling library information that include easy implementations of widespread API capabilities. For instance, Determine 4 exhibits the supply code for the simulated API capabilities CreateFile()
and CloseHandle()
. On this instance, the implementation merely maintains an array meant to mannequin a file deal with desk, which makes it easy to trace which handles are open and closed, nothing extra.
Determine 4: Simulated API capabilities
Trying Forward
In a future submit I’ll current two path evaluation instruments that we’ve carried out on prime of the Ghihorn platform: Path Analyzer and API Analyzer.