-
Notifications
You must be signed in to change notification settings - Fork 51
Description
I'm analyzing parser behavior and need to determine which functions in an instrumented program accessed specific input bytes. This would help understand:
- Which parsing functions process which byte ranges
- How input bytes flow through the call stack
- Which code paths are triggered by specific input patterns
Environment
- PolyTracker version: 4.0.0 (Docker:
trailofbits/polytracker:latest) - Platform: macOS ARM64 (using
--platform linux/amd64) - Python version (in container): 3.10
What I've Successfully Extracted
I've made significant progress exploring the v4.0.0 API and can extract:
1. Function Names
from polytracker.taint_dag import TDFunctionsSection, TDStringSection
# Extract function ID → name mapping
for func_id, fn_header in enumerate(functions_section):
func_name = string_section.read_string(fn_header.name_offset)Results: Successfully extracted all function names: main, parse_expr, program, statement, expr, test, sum, term, etc.
2. Function Call Trace
from polytracker.taint_dag import TDEventsSection
# Iterate through execution trace
for event in events_section:
function_id = event.fnidx # Function being called
event_type = event.kind # ENTRY (0) or EXIT (1)Results: Complete function call trace with proper nesting:
ENTER program
ENTER statement
ENTER expr
ENTER term
3. Input Byte Offsets
# Extract byte offsets from taint forest
for node in trace.taint_forest.nodes():
if node.source is not None:
offset = trace.file_offset(node).offset # Byte offset in input
affected_cf = node.affected_control_flow # Whether byte influenced branchingResults: All input bytes tracked with control flow information.
What's Missing: The Correlation
I cannot find a way to correlate these pieces together to answer:
"Which function accessed byte X?"
For example, given:
- Input file:
{i=1;}\n - Byte 2 (
=character) at offset 2
I need to determine: "The expr function accessed byte 2"
What I've Tried
Approach 1: TDEvent attributes
for event in events_section:
# event has: fnidx, kind
# event does NOT have: label, taints, bytes_accessedResult: Events know which function, but not which bytes it accessed.
Approach 2: Taint nodes
for node in forest.nodes():
# node has: label, source, affected_control_flow
# node does NOT have: function, event, accessed_byResult: Nodes know which byte, but not which function accessed it.
Approach 3: Documented methods
# These raise NotImplementedError:
trace.access_sequence() # NotImplementedError
trace.function_trace() # NotImplementedError
for event in trace: # NotImplementedError (via __iter__)Result: Documented API methods are not implemented in v4.0.0.
Approach 4: Control Flow Log
from polytracker.taint_dag import TDControlFlowLogSection
# CF log has function_id_mapping but unclear how to correlate with taintsResult: Found function_id_mapping attribute but it's a method, and calling it returns empty results.
Questions
-
Is there an API I'm missing?
Is there a method/property that links taint labels to the events/functions that accessed them? -
Should I use a different trace format?
Issue Emitting and loading a DBProgramTrace instead of a TDProgramTrace #6534 mentionedDBProgramTracevsTDProgramTrace. Can I generate.dbfiles whereaccess_sequence()actually works? -
Is this data available internally but not exposed?
If the correlation exists internally but isn't exposed via Python API, would you accept a PR to add it? -
Alternative approach?
Is there a recommended way to achieve this byte-to-function mapping with the current v4.0.0 API?
Minimal Reproduction
# Instrument a C program
docker run --rm --platform linux/amd64 \
-v $(pwd):/workdir -w /workdir \
trailofbits/polytracker bash -c \
"polytracker build clang program.c -o program && \
polytracker instrument-targets --taint --ftrace program"
# Execute with stdin tracking
docker run --rm --platform linux/amd64 \
-v $(pwd):/workdir -w /workdir \
-e POLYDB=polytracker.tdag \
-e POLYTRACKER_STDIN_SOURCE=1 \
trailofbits/polytracker \
bash -c "./program.instrumented < input.txt"
# Analyze the trace
docker run --rm --platform linux/amd64 \
-v $(pwd):/workdir -w /workdir \
trailofbits/polytracker python3 -c "
from polytracker import PolyTrackerTrace
from polytracker.taint_dag import TDFunctionsSection, TDEventsSection, TDStringSection
trace = PolyTrackerTrace.load('polytracker.tdag')
# Can extract functions and events separately,
# but cannot correlate which functions accessed which bytesUse Case
This mapping would enable:
- Parser debugging: Identify which function mishandled a specific byte
- Security analysis: Find which code paths process attacker-controlled bytes
- Execution visualization: Create diagrams showing byte flow through functions
- Performance analysis: Identify hot paths for specific input patterns
Related Issues:
- Emitting and loading a DBProgramTrace instead of a TDProgramTrace #6534 - "Emitting and loading a DBProgramTrace instead of a TDProgramTrace" (similar access_sequence question)