core idea:
the core idea that this research lab is working on is to do interpretability research using type theory.
core basis for how are going about it :
phase1:
- going through these research papers to understand the core way that this field is moving forward:
- Validating Mech Interpretations: An Axiomatic Approach
- Core Idea: Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner
- Validating Mech Interpretations: An Axiomatic Approach
- Moving on how typed chain of thought is moving forward:
- Typed Chain-of-Thought : Using Curry Howard Framework for Verifying LLM Reasoning
- Core Idea: Represent methods to extract and map the informal , natural language steps of CoT into a formal, typed proof structure.
- If we convert a CoT trace into a well-typed proof , it will serve as a strong, verifiable certificate of its computational faithfulness , moving away from heuristics based interpretability towards a more formal verification based approach.
- Typed Chain-of-Thought : Using Curry Howard Framework for Verifying LLM Reasoning
- Then after we have done the mechanisitic interpretability to compositional interepretability
- From Mechanistic to Compositional Interpretability
- Core Idea: Introduces the notion of compositional interpretability, using category theory framework grounded in the principles of compositionally and minimum description length.
- These are pairs of syntactic and semantic mappings that must commute to enforce consistency between a model's decomposition and its observed behaviour.
- Then deconstruct explanation quality into measuring the faithfulness and complexity to cast interpretability as a constrained optimization problem, and then introduce compressive refinement to systematically restructure the models into simpler parts without altering their function.
- At the end prove tha a parsimony criterion under which synactic compression theoretically gurantees more concise , human-aligned explanations.
- From Mechanistic to Compositional Interpretability
- Move into improving model performance via compact proofs
- Compact Proofs of Model Performance via Mechanistic Interpretability
- Core Idea: Formally prove the accuracy lower bounds for a small transformer trained on Max-of-k, validating proof transferability across 151 randoom seeds and 4 values of K.
- Using quantitative metrics , find that shorter proofs seem to require and provide more mechanistic understanding.
- Find that more faithful mechanisirtic undertstanding leads to tighter performance bounds.
- Confirm these connections by qualitatively examining a subset of our proofs.
- Identify compounding structureless errors as a key challenge for using mech interp to generate compact proofs on model performance.
- Compact Proofs of Model Performance via Mechanistic Interpretability
- Mental Model corresponding from Type Theory to Mech Interp:
- Program -> Trained Model
- Type -> Behaviour / property
- Proof -> Circuit / Explanation
- Type Checking -> Faithfulness validation
- Abstract Interpretation -> Axiomatic mech Interp
- Curry Howard -> Typed CoT
- Categorical Semantics -> Compositional Interpretability
- Normal Form / Compression -> MDL-Based explanation quality
- Compositionality -> Circuit Composition / layer-wise refinement