PYC Disassembler Techniques: Tips for Accurate Bytecode AnalysisUnderstanding Python bytecode and learning how to disassemble .pyc files is a valuable skill for reverse engineers, security researchers, forensic analysts, and developers who need to recover lost source code or inspect third‑party libraries. This article covers practical techniques, common pitfalls, and tips for achieving accurate bytecode analysis of PYC files across different Python versions.
What is a PYC file?
A .pyc file holds compiled Python bytecode — the intermediate representation produced by the Python compiler when a .py file is imported or compiled. Bytecode runs on the Python Virtual Machine (PVM) and is platform-independent, but its format and opcode set change between Python versions, which affects disassembly and analysis.
Prepare your environment
- Use a controlled, offline environment for analysis to avoid running untrusted code.
- Install multiple Python versions you expect to encounter (e.g., 3.6, 3.7, 3.8, 3.9, 3.10, 3.11). Tools like pyenv simplify switching versions.
- Keep copies of original .pyc files and never overwrite them.
- Gather tools: Python’s built-in dis module, uncompyle6, decompyle3, pycdc, pyinstxtractor (for extracting from installers), and binary tools like hexdump/xxd.
Identify Python version and PYC format
Before disassembling, determine which Python version produced the .pyc:
- Inspect the header. Modern .pyc files include a 16-byte header containing a magic number and flags/timestamp/hash. The magic number maps to a Python version.
- Quick method: try reading the first 4 bytes as an unsigned little-endian integer and compare with known magic numbers for Python versions.
- If header is stripped or altered, infer version by opcode patterns or try disassembling with different Python versions.
Knowing the correct Python version ensures the right opcode table is used and reduces decompilation errors.
Use the correct disassembler/decompiler
- For raw bytecode inspection and opcode-level analysis, Python’s builtin dis module is reliable: it exposes opcodes, argument values, line numbers, and stack effect.
- Example: import importlib.util; import marshal; read pyc, skip header, marshal.load(); dis.dis(code_object)
- For recovering readable source, use decompilers:
- uncompyle6 supports many Python 2.x/3.x versions.
- decompyle3 targets newer Python 3 versions.
- pycdc and others may produce different output; try multiple tools and compare results.
- For pyc files packed inside installers (PyInstaller, cx_Freeze), extract embedded archives first (pyinstxtractor, binwalk).
Handle code objects and nested structures
A .pyc contains a marshaled top-level code object that may include constants which are themselves code objects (nested functions, lambdas, comprehensions, class bodies). Recursively traverse code.co_consts and disassemble each code object to get a full view of behavior.
Example approach:
- Load code object with marshal.
- Write a recursive function to disassemble and annotate each nested code object with its name and starting line number.
- Track relationships: which code objects are used as defaults, closures, or class bodies.
Understand common obfuscation and packing techniques
Malicious or obfuscated .pyc may use techniques such as:
- Encrypted payloads or XORed bytes — detect by nonstandard headers or invalid marshaled data.
- Custom import hooks that decrypt bytecode at import time.
- Dynamic code generation (exec/compile/ast) where source isn’t present in .pyc.
- Code object mutation: altering co_consts, co_names, bytecode arrays, or line number tables.
To analyze these:
- Look for unusual constants (large byte strings), calls to builtins like exec/compile, or imported modules like ctypes, marshal, or importlib.
- Emulate or instrument execution in a sandboxed interpreter to let the code reveal decrypted bytecode; capture resulting code objects with sys.settrace or by patching builtins.
- Use hexdump and entropy analysis to spot encrypted sections.
Reconstructing control flow and higher-level constructs
Bytecode disassembly shows low-level instructions; mapping them back to high-level constructs improves readability:
- Identify basic blocks by locating jump targets and exception handler ranges (co_exceptiontable / older formats have co_lnotab).
- Reconstruct loops: backward jumps often indicate loops; patterns of SETUP_LOOP and POP_BLOCK (older versions) or JUMP_BACKWARD (newer) help identify them.
- Recreate conditional structure: compare jump-if-true/false instructions and subsequent fall-through paths.
- Map LOAD_GLOBAL/LOAD_FAST/STORE_FAST to variable usage to infer variable types and roles.
Graphing tools (Graphviz) can help visualize control-flow graphs (CFG) of bytecode.
Recovering variable names, constants, and literals
- co_varnames, co_names, co_consts, and co_cellvars/freevars hold names and literals. Use them to annotate disassembly.
- For obfuscated names (short or meaningless), correlate usage patterns (attribute access, function calls) to infer purpose.
- For missing or mangled names, type inference based on opcode sequences (e.g., methods called on an object) can suggest likely types.
Line numbers and source mapping
- co_firstlineno and line number tables (co_lnotab in older Pythons, newer encoded forms in 3.10+) map bytecode offsets to source lines. Use them to approximate original source layout.
- When line number data is missing or coarse-grained, reconstruct likely indentation and block boundaries by analyzing SETUP_* and POP_BLOCK operations and jump targets.
Practical tips for accurate decompilation
- Always try multiple decompilers and cross-check outputs; combine the best parts manually.
- Use the same Python major/minor version that produced the .pyc when running dis or decompilers.
- When output is syntactically incorrect, inspect troublesome functions at bytecode level and fix by hand—small changes often restore structure.
- Preserve original timestamps and headers when recompiling to test fixes.
- Document each transformation: keep both raw disassembly and reconstructed source for auditing.
Automation and scripting
- Automate repetitive analysis: scripts to extract headers, detect Python version, recursively disassemble code objects, and run multiple decompilers.
- Example pipeline:
- Identify format/version from header.
- Extract top-level code object (marshal).
- Recursively dump code object metadata.
- Run decompilers and collect outputs.
- Diff outputs to highlight disagreements.
- Use small unit tests where possible: recompile recovered source and compare bytecode or behavior against original in a safe sandbox.
Common pitfalls and how to avoid them
- Mismatched Python version: produces wrong opcode mapping — always confirm magic number.
- Assuming decompiler output is correct: decompilers can produce valid but semantically different code.
- Running untrusted bytecode directly: always sandbox or use emulation.
- Over-reliance on names: obfuscation often hides intent; rely on behavior and usage instead.
Legal and ethical considerations
Analyzing .pyc files from third-party binaries can raise legal or ethical issues. Ensure you have permission to reverse-engineer or analyze the code. For security research, follow responsible disclosure practices.
Example: minimal Python script to disassemble a .pyc
# dis_pyc.py — Python 3.8+ example import sys, marshal, dis, importlib.util def load_codeobj(pyc_path): with open(pyc_path, "rb") as f: header_size = importlib.util.MAGIC_NUMBER and (16) # adjust if needed f.seek(header_size) return marshal.load(f) def recurse_dis(codeobj, indent=0): print(" " * indent + f"Disassembling {getattr(codeobj, 'co_name', '<module>')} (firstlineno={codeobj.co_firstlineno})") dis.dis(codeobj) for const in codeobj.co_consts: if isinstance(const, type(codeobj)): recurse_dis(const, indent+2) if __name__ == "__main__": code = load_codeobj(sys.argv[1]) recurse_dis(code)
Summary
Accurate PYC disassembly combines the right tooling, correct Python-version identification, careful traversal of nested code objects, and understanding of obfuscation techniques. Use multiple decompilers, sandboxed execution, and manual bytecode inspection to build a faithful reconstruction of original source logic while observing legal constraints.
Leave a Reply