Skip to content

[Question] How to map input bytes to the functions that accessed them in v4.0.0? #6581

@rayansiddique9

Description

@rayansiddique9

I'm analyzing parser behavior and need to determine which functions in an instrumented program accessed specific input bytes. This would help understand:

  • Which parsing functions process which byte ranges
  • How input bytes flow through the call stack
  • Which code paths are triggered by specific input patterns

Environment

  • PolyTracker version: 4.0.0 (Docker: trailofbits/polytracker:latest)
  • Platform: macOS ARM64 (using --platform linux/amd64)
  • Python version (in container): 3.10

What I've Successfully Extracted

I've made significant progress exploring the v4.0.0 API and can extract:

1. Function Names

from polytracker.taint_dag import TDFunctionsSection, TDStringSection

# Extract function ID → name mapping
for func_id, fn_header in enumerate(functions_section):
    func_name = string_section.read_string(fn_header.name_offset)

Results: Successfully extracted all function names: main, parse_expr, program, statement, expr, test, sum, term, etc.

2. Function Call Trace

from polytracker.taint_dag import TDEventsSection

# Iterate through execution trace
for event in events_section:
    function_id = event.fnidx      # Function being called
    event_type = event.kind        # ENTRY (0) or EXIT (1)

Results: Complete function call trace with proper nesting:

ENTER program
  ENTER statement
    ENTER expr
      ENTER term

3. Input Byte Offsets

# Extract byte offsets from taint forest
for node in trace.taint_forest.nodes():
    if node.source is not None:
        offset = trace.file_offset(node).offset  # Byte offset in input
        affected_cf = node.affected_control_flow # Whether byte influenced branching

Results: All input bytes tracked with control flow information.

What's Missing: The Correlation

I cannot find a way to correlate these pieces together to answer:
"Which function accessed byte X?"

For example, given:

  • Input file: {i=1;}\n
  • Byte 2 (= character) at offset 2

I need to determine: "The expr function accessed byte 2"

What I've Tried

Approach 1: TDEvent attributes

for event in events_section:
    # event has: fnidx, kind
    # event does NOT have: label, taints, bytes_accessed

Result: Events know which function, but not which bytes it accessed.

Approach 2: Taint nodes

for node in forest.nodes():
    # node has: label, source, affected_control_flow
    # node does NOT have: function, event, accessed_by

Result: Nodes know which byte, but not which function accessed it.

Approach 3: Documented methods

# These raise NotImplementedError:
trace.access_sequence()  # NotImplementedError
trace.function_trace()   # NotImplementedError
for event in trace:      # NotImplementedError (via __iter__)

Result: Documented API methods are not implemented in v4.0.0.

Approach 4: Control Flow Log

from polytracker.taint_dag import TDControlFlowLogSection

# CF log has function_id_mapping but unclear how to correlate with taints

Result: Found function_id_mapping attribute but it's a method, and calling it returns empty results.

Questions

  1. Is there an API I'm missing?
    Is there a method/property that links taint labels to the events/functions that accessed them?

  2. Should I use a different trace format?
    Issue Emitting and loading a DBProgramTrace instead of a TDProgramTrace #6534 mentioned DBProgramTrace vs TDProgramTrace. Can I generate .db files where access_sequence() actually works?

  3. Is this data available internally but not exposed?
    If the correlation exists internally but isn't exposed via Python API, would you accept a PR to add it?

  4. Alternative approach?
    Is there a recommended way to achieve this byte-to-function mapping with the current v4.0.0 API?

Minimal Reproduction

# Instrument a C program
docker run --rm --platform linux/amd64 \
    -v $(pwd):/workdir -w /workdir \
    trailofbits/polytracker bash -c \
    "polytracker build clang program.c -o program && \
     polytracker instrument-targets --taint --ftrace program"

# Execute with stdin tracking
docker run --rm --platform linux/amd64 \
    -v $(pwd):/workdir -w /workdir \
    -e POLYDB=polytracker.tdag \
    -e POLYTRACKER_STDIN_SOURCE=1 \
    trailofbits/polytracker \
    bash -c "./program.instrumented < input.txt"

# Analyze the trace
docker run --rm --platform linux/amd64 \
    -v $(pwd):/workdir -w /workdir \
    trailofbits/polytracker python3 -c "
from polytracker import PolyTrackerTrace
from polytracker.taint_dag import TDFunctionsSection, TDEventsSection, TDStringSection

trace = PolyTrackerTrace.load('polytracker.tdag')

# Can extract functions and events separately,
# but cannot correlate which functions accessed which bytes

Use Case

This mapping would enable:

  • Parser debugging: Identify which function mishandled a specific byte
  • Security analysis: Find which code paths process attacker-controlled bytes
  • Execution visualization: Create diagrams showing byte flow through functions
  • Performance analysis: Identify hot paths for specific input patterns

Related Issues:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions