Skip to content

repurtum/research

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

research notes for repurtum

core idea:

the core idea that this research lab is working on is to do interpretability research using type theory.

core basis for how are going about it :

phase1:

  • going through these research papers to understand the core way that this field is moving forward:
    • Validating Mech Interpretations: An Axiomatic Approach
      • Core Idea:  Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner
  • Moving on how typed chain of thought is moving forward:
    • Typed Chain-of-Thought : Using Curry Howard Framework for Verifying LLM Reasoning
      • Core Idea: Represent methods to extract and map the informal , natural language steps of CoT into a formal, typed proof structure.
      • If we convert a CoT trace into a well-typed proof , it will serve as a strong, verifiable certificate of its computational faithfulness , moving away from heuristics based interpretability towards a more formal verification based approach.
  • Then after we have done the mechanisitic interpretability to compositional interepretability
    • From Mechanistic to Compositional Interpretability
      • Core Idea: Introduces the notion of compositional interpretability, using category theory framework grounded in the principles of compositionally and minimum description length.
      • These are pairs of syntactic and semantic mappings that must commute to enforce consistency between a model's decomposition and its observed behaviour.
      • Then deconstruct explanation quality into measuring the faithfulness and complexity to cast interpretability as a constrained optimization problem, and then introduce compressive refinement to systematically restructure the models into simpler parts without altering their function.
      • At the end prove tha a parsimony criterion under which synactic compression theoretically gurantees more concise , human-aligned explanations.
  • Move into improving model performance via compact proofs
    • Compact Proofs of Model Performance via Mechanistic Interpretability
      • Core Idea: Formally prove the accuracy lower bounds for a small transformer trained on Max-of-k, validating proof transferability across 151 randoom seeds and 4 values of K.
      • Using quantitative metrics , find that shorter proofs seem to require and provide more mechanistic understanding.
      • Find that more faithful mechanisirtic undertstanding leads to tighter performance bounds.
      • Confirm these connections by qualitatively examining a subset of our proofs.
      • Identify compounding structureless errors as a key challenge for using mech interp to generate compact proofs on model performance.
  • Mental Model corresponding from Type Theory to Mech Interp:
    • Program -> Trained Model
    • Type -> Behaviour / property
    • Proof -> Circuit / Explanation
    • Type Checking -> Faithfulness validation
    • Abstract Interpretation -> Axiomatic mech Interp
    • Curry Howard -> Typed CoT
    • Categorical Semantics -> Compositional Interpretability
    • Normal Form / Compression -> MDL-Based explanation quality
    • Compositionality -> Circuit Composition / layer-wise refinement

About

research and draft notes

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors