DeepExtract is an IDA Pro 9.x plugin that bridges the gap between compiled PE binaries and AI coding agents. It extracts the full structural context of a binary (PE metadata, function signatures, disassembly, decompiled C/C++ code, cross-references, control flow, and more), and writes it in formats that agents like Claude Code, Codex, and Cursor can directly ingest and reason over. The goal is to let these agents read, navigate, and understand PE files the same way they understand source code repositories, enabling AI-assisted vulnerability research and reverse engineering at scale.
Results are written to a per-binary SQLite database, generated as C++ source files organized by module, and optionally as assembly (.asm) files, with rich per-function metadata headers designed for AI agent consumption. DeepExtract operates in two modes: headless for automated batch processing of large binary datasets, and interactive for targeted single-binary analysis within the IDA GUI.
For a deep dive into the motivation and design journey behind DeepExtract, from the initial experiments with raw decompiled C++ dumps through the challenges of making agents reason over disconnected binary data, to the structured extraction pipeline and agent runtime architecture, read the full writeup: Making Compiled Binaries Accessible to AI Coding Agents. The post covers why traditional agent code-navigation tools break on decompiler output, what led to the SQLite-backed structured approach, and how the DeepExtractRuntime extends coding agents with vulnerability research capabilities.
The extraction pipeline has three stages:
PE Binary (.exe/.dll/.sys)
|
v
IDA Pro 9.x + DeepExtract Plugin
(disassembly, decompilation, analysis)
|
v
Structured Output:
- SQLite database (per-binary)
- C++ source files (optional, --generate-cpp)
- Assembly files with AI metadata headers (optional, --generate-asm)
- JSON metadata (module_profile, function_index, file_info)
Stage 1 - Binary Loading. IDA Pro loads the PE file, performs auto-analysis, and builds its internal database (IDB). If PDB symbols are available, IDA resolves function names and type information.
Stage 2 - Extraction. DeepExtract iterates over every function in the IDB and extracts:
- Disassembly and Hex-Rays decompiled code (when available)
- Inbound and outbound cross-references
- Stack frame metrics, string literals, global variable accesses
- Dangerous API calls, loop structures, vtable analysis (experimental), indirect call targets
At the file level, it extracts PE headers, imports, exports, sections, security features, Rich header data, TLS callbacks, and .NET CLR metadata.
Stage 3 - Output. All extracted data is written to a SQLite database. When C++ generation is enabled (--generate-cpp), the plugin groups decompiled functions into source files organized by class and module, and generates JSON metadata files (function_index.json, module_profile.json, file_info.json) alongside a human-readable report (file_info.md). When assembly generation is enabled (--generate-asm), the plugin writes .asm files containing raw disassembly grouped by class membership and address order, with structured per-function headers (callers, callees, strings, dangerous APIs) for AI agent analysis. Library/CRT functions are separated into dedicated files. Assembly output is independent of decompiler availability.
-
PE (Portable Executable): The binary format used by Windows for executables (
.exe), dynamic libraries (.dll), and drivers (.sys). PE files contain code sections, import/export tables, resource data, and metadata headers. -
IDA Pro: A disassembler and reverse engineering platform. IDA loads a binary, identifies functions, resolves cross-references, and builds a navigable representation of the code. The analysis results are stored in an IDA database (
.idb/.i64). -
Hex-Rays Decompiler: An IDA Pro component that converts disassembled machine code into a C-like pseudo-code representation. DeepExtract stores this output as
decompiled_codefor each function. -
Cross-references (xrefs): Records of which functions call which other functions. Inbound xrefs list all callers of a function; outbound xrefs list all functions it calls. These form the binary's call graph.
-
Headless mode: Running IDA Pro from the command line without a GUI, using
idat.exe/idat64.exewith the-Aflag. DeepExtract detects this mode and runs the full extraction pipeline automatically on startup. -
Structured output: The SQLite database and optional C++ files produced by DeepExtract. The database contains three tables (
file_info,functions,function_xrefs) and is designed for programmatic queries by SQL, Python, or AI agent frameworks.
Install the plugin:
hcli plugin install DeepExtractExtract a single binary (headless):
"C:\Program Files\IDA Professional 9.2\idat.exe" -A -L"C:\output\log.txt" -S"main.py --sqlite-db C:\output\kernel32.db --generate-cpp" "C:\Windows\System32\kernel32.dll"Batch-extract a directory:
.\headless_batch_extractor.ps1 -ExtractDirRecursive "C:\Windows\System32" -StorageDir "C:\Analysis"Interactive mode: Open a binary in IDA Pro, then Edit > Plugins > DeepExtract (or Ctrl-Shift-E).
The extractor operates at two levels: file-level metadata and per-function analysis.
The plugin extracts 30+ metadata points per binary:
- Identification: MD5, SHA256, file size, extension
- PE headers: Sections, entry points, Rich header (linker toolchain data), TLS callbacks
- Version information: Product name, company name, copyright, original filename, PDB path
- Security features: ASLR, DEP/NX, CFG, SafeSEH status, DLL characteristics
- Runtime environment: .NET assembly detection, CLR metadata, delay-load DLL imports
For every identified function:
- Signatures: Base and extended function signatures, including demangled and mangled names
- Code: Full disassembly and Hex-Rays decompiled C/C++ output
- Cross-references: Inbound and outbound xrefs in full and simplified formats, plus a deduplicated relational table for SQL-based call graph queries
- Dangerous API detection: Matches outbound calls against 480+ security-critical APIs (e.g., CreateRemoteThread, LoadLibrary, CreateProcessW)
- String literals: Per-function string references
- Global variable accesses: Read/write references to global data
- Stack frame analysis: Aggregate frame sizes (locals, arguments, saved registers), frame pointer and exception handler flags, and stack canary detection via multi-heuristic analysis (variable names, security cookie calls, XOR patterns)
- Loop analysis: Natural loop detection via dominator-based back edges with SCC fallback for irreducible control flow, per-loop cyclomatic complexity, and infinite loop heuristic (zero exit edges)
- VTable analysis (experimental): Virtual call resolution for
[reg+offset]call patterns, vtable slot inspection, and per-class method grouping from demangled names. Limited to single vtable at object offset 0; no multiple/virtual inheritance, thunk handling, or RTTI-based class hierarchy inference - Indirect call resolution (experimental): Backward pattern matching to resolve common indirect call and jump patterns (register loads, memory dereferences, function pointer arrays), jump table detection via IDA's switch analysis with manual fallback, and basic obfuscation handling (XOR/ADD/SUB transforms). Coverage is limited to intra-procedural heuristics within a short instruction window
Each binary produces a SQLite database containing three tables.
file_info stores binary-level metadata:
file_path,file_name,file_extension,file_size_bytesmd5_hash,sha256_hashimports,exports,entry_pointfile_version,product_version,company_name,pdb_pathrich_header,tls_callbacks,is_net_assembly,clr_metadatadll_characteristics,security_features,exception_info
functions stores per-function analysis data:
function_signature,mangled_name,function_nameassembly_code,decompiled_codeinbound_xrefs,outbound_xrefs(full and simple JSON)vtable_contexts,global_var_accesses,dangerous_api_callsstring_literals,stack_frame,loop_analysisanalysis_errors,created_at
function_xrefs stores deduplicated cross-references for SQL-based call graph queries:
source_id,target_id(foreign keys intofunctions)target_name,target_modulefunction_type(generic, library, API, vtable, etc.)xref_type,direction(inbound/outbound)- Unique constraint on
(source_id, target_id, target_name, target_module, xref_type, direction)
When --generate-cpp is enabled, the plugin writes decompiled functions as grouped C++ source files:
- Class methods: Grouped by class into files of approximately 450-500 lines, named
{module}_{class}_group_N.cpp. Methods are ordered alphabetically; each is preceded by a comment block with its name and signature. - Standalone functions: Grouped into files of approximately 450-500 lines, named
{module}_standalone_group_N.cpp, with the same ordering and comment conventions. function_index.json: Maps every function name to its.cppfile and library tag (WIL, STL, WRL, CRT, ETW/TraceLogging, ornullfor application code). See the Function Index Format Reference.module_profile.json: Pre-computed module fingerprint covering identity, scale, library composition, API surface, complexity metrics, and security posture. See the Module Profile Format Reference.file_info.md/file_info.json: Human-readable and machine-readable analysis reports. See the Analysis Metadata and Reports Reference.
Only functions with real Hex-Rays pseudocode are included in C++ output. Functions where decompilation failed (license unavailable, timeout, empty output, or decompiler returned None) are excluded from .cpp files and correctly reported as failed in module_profile.json and function_index.json.
When --generate-asm is enabled, the plugin writes raw disassembly into grouped .asm files. This output is independent of the Hex-Rays decompiler; it works with any IDA license that supports disassembly. Assembly files are written alongside C++ files in the same extracted_code/<module>/ directory by default, or to a custom directory via --asm-output-dir.
- Class methods: Grouped by class name (sorted by address within each class), named
{module}_{class}_group_N.asm. - Standalone functions: Sorted by address order and split at approximately 2000-2500 lines per file, named
{module}_standalone_group_N.asm. - Library/CRT functions: Separated into dedicated
{module}_library.asmfile(s) to avoid polluting application code analysis.
Each function is preceded by a structured metadata header:
; ============================================================
; Function: sub_401000
; Address: 0x401000
; Signature: int __cdecl sub_401000(int a1, int a2)
; Callers: WinMain (id:1), sub_402300 (id:45)
; Callees: WSAStartup [WS2_32.dll], socket [WS2_32.dll], sub_4015A0 (id:12)
; Strings: "Failed to connect", "192.168.1.1"
; Dangerous APIs: connect, WSAStartup
; ============================================================The headers include cross-references (callers/callees with module attribution), string literals, dangerous API calls, and library tags, enabling AI agents to navigate the assembly without needing to query the database. When both C++ and ASM generation are enabled, function_index.json is updated with asm_files entries alongside the existing files (C++) entries.
DeepExtract supports two deployment methods:
- Plugin deployment: Install into the IDA plugins directory. Once installed, the plugin is available in the GUI via Edit > Plugins > DeepExtract (or
Ctrl-Shift-E) for interactive single-binary analysis, and via the command line for headless batch processing. - Standalone execution: Clone the repository and run headless extraction directly from the source directory. See Headless Batch Extraction and Headless Mode (Individual File) below.
To install as a plugin:
hcli plugin install DeepExtractWhen a binary is open in IDA Pro, the plugin runs within the GUI and is accessible via:
- Menu: Edit > Plugins > DeepExtract
- Hotkey:
Ctrl-Shift-E
This mode is designed for targeted analysis of a single binary. It presents a configuration dialog for:
- Output paths: SQLite database path and C++ output directory
- Feature selection: Dangerous APIs, strings, loops, stack frames
- PE metadata: Metadata extraction, Advanced PE, runtime info
- Analysis parameters: Thunk resolution depth and call validation confidence threshold
- Progress monitoring: Status indicator for the analysis pipeline
The interactive mode captures the current state of the researcher's IDA session, including renamed variables, custom comments, and manual type definitions stored in the .idb/.i64.
The headless batch extractor processes PE binaries at scale without IDA's GUI. It accepts directories, file lists, or running process IDs as input, spawns concurrent IDA instances, and writes structured output (SQLite databases, C++ source files, JSON metadata) to a storage directory. The output is organized per-module and ready for analysis through AI agents such as Cursor or Claude Code via DeepExtractRuntime.
Resource expectations: Batch extraction of large binary sets can run for several days and produce tens to hundreds of gigabytes of output depending on the number and size of modules. Plan disk space and machine availability accordingly.
Typical applications:
- Process context capture: Extract all modules loaded by a running process (
-TargetPid) to reconstruct the full execution context of a target application, service, or malware sample. - OS internals analysis: Extract
C:\Windows\System32to build a queryable, decompiled representation of Windows usermode libraries for understanding OS functionality, API behavior, and inter-component dependencies. - Targeted binary auditing: Point the extractor at a specific set of binaries to produce structured data for vulnerability research, threat hunting, or code review workflows.
Clone the repository:
git clone https://github.com/marcosd4h/DeepExtractIDA.git
cd DeepExtractIDAThe script requires IDA Pro 9.x installed on the system. It auto-detects the IDA installation path; no additional configuration is needed. Run the extractor directly from the cloned directory:
.\headless_batch_extractor.ps1 -ExtractDirRecursive "C:\Windows\System32" -StorageDir "C:\funvr\system32_internals"The script locates IDA, iterates over all PE files in the target directory, downloads PDB symbols (enabled by default), and launches concurrent IDA processes to extract each binary. Results are written to StorageDir, organized by module, with each module producing a SQLite database, C++ source files, and JSON metadata.
The script supports three input modes, which can be combined in a single invocation:
- Directory scan:
-ExtractDirRecursivefor recursive scanning,-ExtractDirfor top-level only. Both accept comma-separated lists and can be used together. - File list:
-FilesToAnalyzeaccepts a text file with one path per line. - PID mode:
-TargetPidextracts all modules loaded by one or more running processes (comma-separated PIDs).
Files from all sources are merged into one batch and deduplicated. C++ code generation is enabled by default in batch mode; disable with -NoGenerateCpp. Assembly generation is disabled by default; enable with -GenerateAsm.
Additional batch parameters:
| Flag | Description |
|---|---|
-MaxConcurrentProcesses |
Number of parallel IDA processes (default: 4) |
-StorageDir |
Output directory for all analysis results (required) |
-IdaPath |
Path to IDA executable (auto-detected if omitted) |
<StorageDir>/
├── AGENTS.md # AI agent runtime bootstrap (Cursor/Codex)
├── CLAUDE.md # AI agent runtime bootstrap (Claude Code)
├── analyzed_modules_list.txt # List of files analyzed (all modes)
├── extraction_report.json # Summary report with success/failure stats
├── analyzed_files.db # Master tracking database
├── extracted_dbs/
│ └── <filename>_<hash>.db # Individual analysis databases (one per file)
├── extracted_code/
│ └── <module_name>/ # Per-module output directory
│ ├── *.cpp # Generated C++ code (unless -NoGenerateCpp)
│ ├── function_index.json # Function-to-file lookup index
│ ├── module_profile.json # Pre-computed module fingerprint
│ ├── file_info.json # Structured analysis metadata
│ └── file_info.md # Human-readable analysis report
│ ├── *_standalone_group_N.asm # Assembly output (when -GenerateAsm)
│ ├── *_<class>_group_N.asm # Class method assembly groups
│ └── *_library.asm # Library/CRT assembly (separated)
├── logs/
│ ├── batch_extractor_<timestamp>.log # PowerShell batch execution log
│ ├── <filename>_<hash>_<timestamp>.log # IDA analysis logs
│ ├── symchk_<filename>_<timestamp>.log # Symbol download logs (if enabled)
│ └── symchk_<filename>_<timestamp>.log.err # Symbol download error logs (if enabled)
└── idb_cache/
└── <filename>_<hash>.i64 # IDA database files
The extraction_report.json contains: extraction timestamp and mode, summary statistics (total, successful, failed), list of successfully extracted files with paths, and list of failed extractions with error details.
The script searches for IDA Pro 9.x installations in standard paths:
C:\Program Files\IDA Professional 9.x\
C:\Program Files\IDA Pro 9.x\
C:\Program Files (x86)\IDA Professional 9.x\
C:\Program Files (x86)\IDA Pro 9.x\
The latest version is selected. Override with the -IdaPath parameter.
The script downloads PDB debug symbols from Microsoft's public symbol server before IDA analysis. This is enabled by default and allows IDA to resolve function names and type information for richer output.
- Enabled by default; disable with
-NoDownloadSymbols - Downloads symbols only for the files being analyzed
- Runs up to 10 parallel
symchk.exeprocesses - Stores symbols in a local cache (default:
C:\symbols), reused across runs - Sets
_NT_SYMBOL_PATHat user level (no admin required)
Requires symchk.exe from the Windows SDK Debugging Tools. The script auto-detects it from standard Windows SDK paths, or accepts a manual path via -SymchkPath.
| Flag | Description |
|---|---|
-NoDownloadSymbols |
Skip automatic PDB downloading (enabled by default) |
-SymbolStorePath |
Local symbol cache directory (default: C:\symbols) |
-SymchkPath |
Path to symchk.exe (auto-detected from Windows SDK) |
-SymbolServerUrl |
Symbol server URL (default: Microsoft public server) |
| Flag | Description |
|---|---|
-NoExtractDangerousApis |
Skip dangerous API detection (480+ APIs) |
-NoExtractStrings |
Skip string literal extraction |
-NoExtractStackFrame |
Skip stack frame analysis |
-NoExtractGlobals |
Skip global variable tracking |
-NoAnalyzeLoops |
Skip loop analysis |
-NoPeInfo |
Skip PE version information extraction |
-NoPeMetadata |
Skip PE metadata extraction |
-NoAdvancedPe |
Skip Rich header and TLS callback analysis |
-NoRuntimeInfo |
Skip .NET and delay-load DLL analysis |
-ForceReanalyze |
Force re-analysis even if already processed |
-NoGenerateCpp |
Skip C++ code generation for AI review |
-GenerateAsm |
Enable assembly (.asm) file generation (disabled by default) |
Directory scan (recursive):
.\headless_batch_extractor.ps1 -ExtractDirRecursive "C:\Windows\System32" -StorageDir "C:\funvr\system32_internals"Mixed recursive and non-recursive:
.\headless_batch_extractor.ps1 `
-ExtractDirRecursive "C:\Windows\System32","C:\Windows\SystemApps" `
-ExtractDir "C:\Windows" `
-StorageDir "C:\funvr\more_windows_internals"File list mode:
.\headless_batch_extractor.ps1 -FilesToAnalyze "targets.txt" -StorageDir "C:\funvr\vr_campaign1"Where targets.txt contains:
C:\Windows\System32\kernel32.dll
C:\Windows\System32\ntdll.dll
C:\Program Files\MyApp\app.exe
PID mode (single and multiple processes):
.\headless_batch_extractor.ps1 -TargetPid 1234 -StorageDir "C:\Analysis"
.\headless_batch_extractor.ps1 -TargetPid 1234,5678 -StorageDir "C:\Analysis"Combined options:
.\headless_batch_extractor.ps1 `
-ExtractDirRecursive "C:\Binaries" `
-StorageDir "C:\Analysis" `
-IdaPath "C:\IDA92\idat64.exe" `
-MaxConcurrentProcesses 8 `
-NoGenerateCpp -NoDownloadSymbolsExtracts and decompiles the Windows usermode codebase into SQLite databases and C++ source files. Covers System32 (core libraries and kernel drivers), SystemApps and ImmersiveControlPanel (UWP/packaged apps), Program Files and Program Files (x86) (installed applications and shared frameworks), and IME (input method components) recursively, plus top-level PE files under C:\Windows (e.g., explorer.exe, regedit.exe).
.\headless_batch_extractor.ps1 `
-ExtractDirRecursive 'C:\Windows\System32','C:\Windows\SystemApps','C:\Program Files','C:\Program Files (x86)','C:\Windows\IME','C:\Windows\ImmersiveControlPanel' `
-ExtractDir 'C:\Windows' `
-StorageDir "F:\Analysis\win11_full" `
-MaxConcurrentProcesses 8.\headless_batch_extractor.ps1 -Help
Get-Help .\headless_batch_extractor.ps1 -Detailed
Get-Help .\headless_batch_extractor.ps1 -Full
Get-Help .\headless_batch_extractor.ps1 -ExamplesFor single-file analysis or custom scripting, invoke the plugin directly via IDA's command-line tool (idat.exe or idat64.exe).
"C:\Program Files\IDA Professional 9.2\idat.exe" -A -L"C:\temp\pe_extraction_tests\output.log" -S"main.py --sqlite-db C:\temp\pe_extraction_tests\bitlockercsp.db" "C:\windows\system32\bitlockercsp.dll"IDA command-line arguments:
-A: Autonomous mode (no GUI)-L: Log file path-S: Plugin script to execute (main.py)--sqlite-db: Absolute path to the output SQLite database (required)
Optional analysis flags:
--no-extract-dangerous-apis # Skip dangerous API detection
--no-extract-strings # Skip string literal extraction
--no-extract-stack-frame # Skip stack frame analysis
--no-extract-globals # Skip global variable tracking
--no-analyze-loops # Skip loop analysis
--no-pe-info # Skip PE version info
--no-pe-metadata # Skip PE metadata
--no-advanced-pe # Skip Rich header/TLS callbacks
--no-runtime-info # Skip .NET/delay-load analysis
--force-reanalyze # Force re-analysis even if already complete
--generate-cpp # Generate C++ output files for AI review
--cpp-output-dir <path> # Custom directory for C++ output (defaults to extracted_raw_code/ next to db)
--generate-asm # Generate assembly (.asm) files with AI metadata headers
--asm-output-dir <path> # Custom directory for ASM output (defaults to same dir as C++ output)
--thunk-depth N # Maximum thunk resolution depth (default: 5)
--min-call-conf N # Minimum confidence for call validation (10-100)DeepExtract produces structured data for several research workflows. Detailed documentation for each is in progress.
The headless extractor generates C++ representations of decompiled binaries and optionally assembly files with AI metadata headers into the extracted_code/ directory, organized by module. LLMs (Claude Code, Cursor, Codex) consume the .cpp and/or .asm files alongside file_info.md to evaluate function logic, detect call patterns, and identify security-relevant invariants. Assembly output is particularly useful when the Hex-Rays decompiler is unavailable or not licensed for the target architecture. For structured analysis workflows on top of this output, see the AI Analysis Runtime section.
The interactive plugin exports the current IDA database state, including renamed variables, custom comments, and manual type definitions from the .idb/.i64, into a SQLite database. Researchers can query the functions, file_info, and function_xrefs tables directly via SQL to analyze cross-references, data types, and metadata.
Automated agents use the structured inbound_xrefs, outbound_xrefs, and the function_xrefs table to perform call graph traversal, evaluate reachability, resolve component dependencies, and generate technical summaries of subroutines based on their position in the global call graph.
The extraction output is designed to be consumed by DeepExtractRuntime, a companion analysis runtime that operates on top of the SQLite databases and C++ files. The runtime deploys as an .agent/ directory alongside the extraction data and operates across Claude Code, Cursor, Codex, and any AI coding environment that supports AGENTS.md or equivalent agent configuration.
The runtime provides:
- Slash commands for interactive analysis:
/triage,/audit,/explain,/lift-class,/trace-export,/data-flow,/taint,/hunt,/state-machines,/full-report - Specialized agents (code-lifter, re-analyst, triage-coordinator, type-reconstructor, verifier) that execute multi-step analysis pipelines
- Analysis skills covering function classification, call graph tracing, taint analysis, COM/WRL interface reconstruction, attack surface mapping, type reconstruction, and decompiler verification
- Shared helper modules providing database access, function resolution, API taxonomy (17 functional + 11 security categories), assembly metrics, struct scanning, caching, and cross-module graph analysis
- Lifecycle hooks that inject module context at session start and support batch processing
The headless batch extractor writes two bootstrap files (AGENTS.md and CLAUDE.md) into the output directory. These files contain the full installation procedure and are recognized automatically by AI coding agents. To install the runtime:
- Open the extraction output directory (the
StorageDiryou passed toheadless_batch_extractor.ps1) as a project in Claude Code or Cursor. - Type
install DeepExtractRuntimein the agent chat.
The agent reads the bootstrap instructions from AGENTS.md / CLAUDE.md and executes the full setup automatically: cloning the DeepExtractRuntime repository into .agent/, creating the .claude symlink for Claude Code, copying .cursor/hooks.json and .cursor/rules/*.mdc rule files for Cursor, and verifying the installation. No manual steps are required beyond the initial command.
Once installed, the runtime's slash commands (/triage, /audit, /explain, etc.) and specialized agents become available in the agent session. To update to the latest runtime version, type update DeepExtractRuntime.
See the DeepExtractRuntime README and Onboarding Guide for full documentation.
- Operating System: Windows 10/11
- IDA Pro: Version 9.x (Pro edition required for headless mode)
- Decompiler: Hex-Rays Decompiler (optional; required for decompiled code output and C++ generation)
- Python: Python 3 environment configured within IDA (built-in with IDA 9.x)
- Windows SDK Debugging Tools (optional): Required for automatic symbol downloading (
symchk.exe). Included with WinDbg and the Windows SDK; select "Debugging Tools for Windows" during SDK installation. - Dependencies:
pefile(bundled indeps/; used for PE header parsing)- IDA Python SDK (built-in with IDA Pro)
Technical references for the extraction formats and database schemas:
- Data Format Reference: SQLite schema, data architecture, and analysis heuristics
- Analysis Metadata and Reports Reference:
file_info.md,file_info.json, and C++ code output structure - Function Index Format Reference:
function_index.jsonformat and library tagging - Module Profile Format Reference:
module_profile.jsoncomputation covering identity, scale, library composition, API surface, complexity, and security posture
DeepExtract conforms to the IDA 9.x plugmod_t plugin architecture.
Entry Point: main.py (IDA plugin entry point via PLUGIN_ENTRY())
Core Modules:
deep_extract/pe_context_extractor.py- Main analysis pipeline and orchestrationdeep_extract/extractor_core.py- Public API hub; re-exports from all analysis modulesdeep_extract/config.py- Configuration dataclass and validationdeep_extract/constants.py- Analysis limits, function type classification, dangerous API matching, decompilation failure sentinel detectiondeep_extract/schema.py- SQLite schema management and migrationdeep_extract/db_connection.py- SQLite connection management and PRAGMA configurationdeep_extract/logging_utils.py- Logging, memoization caching, and utility functionsdeep_extract/json_safety.py- JSON serialization with truncation and size limitsdeep_extract/__init__.py- Package init; re-exports public API fromextractor_coreandpe_context_extractor
Analysis Modules:
deep_extract/xref_analysis.py- Cross-reference analysis and call graph buildingdeep_extract/vtable_analysis.py- C++ vtable call resolution and method grouping (experimental)deep_extract/loop_analysis.py- Control flow and loop detection (dominator-based, SCC fallback)deep_extract/indirect_call_analysis.py- Indirect call resolution and jump table detectiondeep_extract/interprocedural_analysis.py- Cross-function data flow analysisdeep_extract/thunk_analysis.py- Thunk chain resolution with configurable depthdeep_extract/string_analysis.py- String literal extraction per functiondeep_extract/stack_analysis.py- Aggregate stack frame metrics and canary detectiondeep_extract/name_extraction.py- Function name extraction and demanglingdeep_extract/import_resolution.py- Import module resolution (IAT address to module name)deep_extract/validation.py- Call validation with confidence scoringdeep_extract/pe_metadata.py- PE header, Rich header, TLS callback extraction
Output Generation:
deep_extract/cpp_generator.py- C++ code generation for AI consumptiondeep_extract/asm_generator.py- Assembly file generation with AI metadata headersdeep_extract/module_profile.py- Module fingerprint generation (module_profile.json)deep_extract/gui_dialog.py- Interactive mode configuration dialog
Utilities:
deep_extract/utils/check_analyzed_files.py- Batch analysis file-selection helper (hash, flags, stale lock checks)
Plugin Lifecycle:
- IDA loads
main.pyand invokesPLUGIN_ENTRY() - Plugin factory (
DeepExtractPlugin) initializes and creates a module instance - Plugin module (
DeepExtractModule) executes per-database - Mode detection: presence of
--sqlite-dbin arguments selects headless mode; otherwise the GUI dialog is displayed - Headless mode: runs the full pipeline, then exits via
ida_pro.qexit() - Interactive mode: displays the configuration dialog, then runs the pipeline with user-selected options
- Script mode (
-S):main.pydetects script execution, createsDeepExtractModuledirectly, and runs the pipeline withoutPLUGIN_ENTRY()
DeepExtract is packaged as an IDA 9.x plugin following the HCLI plugin format.
Package contents:
ida-plugin.json- Plugin metadata and dependency specificationmain.py- Plugin entry pointdeep_extract/- Core analysis frameworkdeps/- Bundled dependencies (pefile)
DeepExtract - Developed by Marcos Oviedo for Agentic Vulnerability Research