Skip to content

hreyulog/github_opinion_dynamics

Repository files navigation

Overview

(generated by deepwiki) dataset is on huggingface: https://huggingface.co/datasets/hreyulog/GitHub-3Repo-7User-Opinion-Dynamics

Purpose and Scope

The GitHub Opinion Dynamics system is a research pipeline designed to analyze how developer "opinions" evolve over time within software development communities. The system treats code changes as expressions of opinion, using machine learning embeddings to quantify semantic differences between old and new code versions. By aggregating and analyzing these embeddings across authors and time periods, the system produces time-series visualizations showing how different contributors' coding approaches converge or diverge.

This document provides a high-level introduction to the system architecture, data flow, and key components. For detailed information about specific subsystems:

System Architecture

The system follows a multi-stage pipeline architecture, processing GitHub repository data through several transformation layers:

High-Level Architecture Diagram

flowchart TD

subgraph Network_Analysis_Branch ["Network Analysis Branch"]
end

subgraph Visualization_Layer ["Visualization Layer"]
end

subgraph Dimensionality_Reduction ["Dimensionality Reduction"]
end

subgraph Aggregation_Layer ["Aggregation Layer"]
end

subgraph Embedding_Generation_Layer ["Embedding Generation Layer"]
end

subgraph Code_Extraction_Layer ["Code Extraction Layer"]
end

subgraph Data_Collection_Layer ["Data Collection Layer"]
end

MAX["get_max_repo.pyRepository Selection"]
ONE["get_one_repository.pySingle Repo Filtering"]
USERS["get_7_users_pr.pyKey User Selection"]
SNIPPET["get_old_new_snippet.pyPatch Parser"]
OLD["old_code extraction"]
NEW["new_code extraction"]
EMB["get_embedding.pyget_pytorch_serve_embeddings()"]
SERVE["PyTorch Servinglocalhost:8080/predictions/emb_comp"]
DIFF["difference_emb = new_code_emb - old_code_emb"]
PR["get_average_pr.pyPR-Level Aggregation"]
TIME["get_average_time.pycompute_mean_emb()"]
PCA["PCA PipelineStandardScaler → PCA → QuantileTransformer → MinMaxScaler"]
DRAW["opinion_draw.pyTime Series Plots"]
OUT["opinion_{repo}_all3.png"]
LINK["get_link_users.pyInteraction Matrix"]
NET["network_comment_issue.pyInteractive Graphs"]
UTIME["user_issue_time.pyTemporal Activity"]

    MAX --> ONE
    ONE --> USERS
    ONE --> SNIPPET
    SNIPPET --> OLD
    SNIPPET --> NEW
    OLD --> EMB
    NEW --> EMB
    EMB --> SERVE
    EMB --> DIFF
    DIFF --> PR
    PR --> TIME
    TIME --> PCA
    PCA --> DRAW
    DRAW --> OUT
    ONE --> LINK
    ONE --> NET
    ONE --> UTIME
Loading

Sources: get_embedding.py L1-L44

get_old_new_snippet.py L1-L89

get_average_time.py L1-L124

Core Data Flow

The system processes data through seven distinct stages:

Stage Input Process Output Key Script
1. Repository Selection GitHub PR data Filter by issue count filtered_pr_{repo}.csv get_max_repo.py
2. User Identification PR metadata Identify top 7 contributors User list get_7_users_pr.py
3. Code Extraction Git patches Parse diffs into old/new code {repo}_diff.csv get_old_new_snippet.py
4. Embedding Generation Code snippets Query PyTorch model difference_code_emb_{repo}.npy get_embedding.py
5. PR Aggregation Difference embeddings Mean by PR pr_emb_{repo}.npy get_average_pr.py
6. Author-Time Aggregation PR embeddings Mean by author + time author_time_emb_{repo}.npy get_average_time.py
7. Dimensionality Reduction 768D embeddings PCA to 1D author_time_{repo}.csv get_average_time.py

Sources: get_embedding.py L6-L44

get_old_new_snippet.py L16-L89

get_average_time.py L28-L118

Key Components

1. PyTorch Serving Interface

The system relies on an external PyTorch model serving endpoint at http://localhost:8080/predictions/emb_comp that converts code strings into 768-dimensional embedding vectors. The get_pytorch_serve_embeddings() function handles communication with this service.

Key Functions:

  • get_pytorch_serve_embeddings() in get_embedding.py L6-L22
  • POST requests to localhost:8080/predictions/emb_comp
  • Returns embeddings as torch.Tensor objects

Sources: get_embedding.py L6-L26

2. Patch Parser

The get_old_new_snippet.py module parses git diff patches to extract the before/after code states for each file change. It processes patches line-by-line, identifying added lines (+), removed lines (-), and unchanged context lines.

Key Processing Steps:

  • Filter files by extension (exclude .md, include code files with extensions)
  • Parse diff lines starting with -, +, or context markers
  • Remove @@ hunk headers using regex
  • Normalize whitespace to create single-line code strings

Sources: get_old_new_snippet.py L16-L85

3. Embedding Aggregation Pipeline

Embeddings are aggregated in two stages using the compute_mean_emb() function:

flowchart TD

DIFF["difference_code_emb_{repo}.npy(Code-level embeddings)"]
PR_EMB["pr_emb_{repo}.npy(PR-level embeddings)"]
AUTH_EMB["author_time_emb_{repo}.npy(Author-time embeddings)"]
Loading

The compute_mean_emb() function get_average_time.py L17-L22

performs array stacking and mean computation:

Sources: get_average_time.py L17-L40

get_embedding.py L30-L34

4. Multi-Stage Transformation Pipeline

The dimensionality reduction pipeline in get_average_time.py applies four sequential transformations to convert 768D embeddings to interpretable 1D opinion scores:

flowchart TD

INPUT["author_time_emb_{repo}.npy(768 dimensions)"]
STAGE1["StandardScalerNormalize to μ=0, σ=1"]
STAGE2["PCA(n_components=1)Extract principal component"]
STAGE3["QuantileTransformeroutput_distribution='normal'"]
STAGE4["MinMaxScalerScale to [0,1]"]
OUTPUT["author_time_{repo}.csv(1 dimension)"]

    INPUT --> STAGE1
    STAGE1 --> STAGE2
    STAGE2 --> STAGE3
    STAGE3 --> STAGE4
    STAGE4 --> OUTPUT
Loading

Implementation: get_average_time.py L46-L58

Sources: get_average_time.py L42-L62

5. Time Series Filtering

The system identifies consecutive time periods where all authors have data, selecting the longest such period for visualization. The algorithm get_average_time.py L71-L96

groups consecutive months with complete author coverage.

Key Logic:

  • Sort time periods chronologically
  • Group consecutive months where all authors contributed
  • Select longest consecutive group
  • Apply additional 4-month offset and 12-month window

Sources: get_average_time.py L68-L96

Data Artifacts

The system generates multiple intermediate and final data artifacts at each stage:

Embedding Artifacts

Artifact Dimensions Description
new_code_emb_{repo}.npy N × 768 Embeddings of new code versions
old_code_emb_{repo}.npy N × 768 Embeddings of old code versions
difference_code_emb_{repo}.npy N × 768 Difference embeddings (new - old)
pr_emb_{repo}.npy M × 768 PR-level aggregated embeddings
author_time_emb_{repo}.npy K × 768 Author-time aggregated embeddings

Sources: get_embedding.py L32-L34

[get_average_time.py

40](https://github.com/hreyulog/github_opinion_dynamics/blob/6e264d44/get_average_time.py#L40-L40)

Output Artifacts

Artifact Format Description
opinion_{repo}_all3.png PNG Time series plot of author opinions
author_time_{repo}.csv CSV Author-time embeddings with 1D PCA scores
{repo}_for_mathematica.csv CSV Pivoted time series for external analysis
pr_time_{repo}.csv CSV PR metadata with timestamps and authors

Sources: get_average_time.py L62-L117

Execution Model

The system operates in batch mode, processing multiple repositories sequentially:

flowchart TD

START["Script Execution"]
REPOS["repo_names = ['swift', 'pytorch', 'ceph']"]
LOOP["For each repo_name in repo_names"]
MAIN["main(repo_name)"]
END["All repos processed"]

    START --> REPOS
    REPOS --> LOOP
    LOOP --> MAIN
    MAIN --> LOOP
    LOOP --> END
Loading

Example: get_average_time.py L119-L124

shows the standard batch execution pattern used across all main scripts.

Sources: get_average_time.py L119-L124

get_embedding.py L37-L40

get_old_new_snippet.py L86-L89

System Dependencies

External Services

  • PyTorch Serving: HTTP endpoint at localhost:8080 serving the emb_comp model
  • Network File System: Data storage at /srv/nfs/VESO/... (referenced in diagrams)

Python Libraries

  • Data Processing: pandas, numpy
  • ML/Dimensionality Reduction: sklearn (PCA, StandardScaler, QuantileTransformer, MinMaxScaler)
  • Deep Learning: torch (for tensor operations)
  • HTTP Communication: requests (for PyTorch Serving API calls)
  • Visualization: matplotlib, seaborn, plotly (for plots and graphs)
  • Network Analysis: networkx (for graph operations)

Sources: get_average_time.py L1-L12

get_embedding.py L1-L5

Target Repositories

The system is configured to analyze three large-scale open source projects:

Repository Language Notable Characteristics
swift Swift Apple's programming language
pytorch Python/C++ Deep learning framework
ceph C++ Distributed storage system

These repositories were selected for their high activity levels and diverse contributor bases.

Sources: [get_average_time.py

120](https://github.com/hreyulog/github_opinion_dynamics/blob/6e264d44/get_average_time.py#L120-L120)

[get_embedding.py

38](https://github.com/hreyulog/github_opinion_dynamics/blob/6e264d44/get_embedding.py#L38-L38)

[get_old_new_snippet.py

87](https://github.com/hreyulog/github_opinion_dynamics/blob/6e264d44/get_old_new_snippet.py#L87-L87)

System Workflow Summary

The complete execution workflow follows this sequence:

  1. Data Collection → Filter repositories and identify key contributors
  2. Code Extraction → Parse git patches into old/new code pairs
  3. Embedding Generation → Query PyTorch model to get 768D vectors for code
  4. Compute Differences → Calculate new_code_emb - old_code_emb
  5. PR Aggregation → Average difference embeddings by pull request
  6. Author-Time Aggregation → Average PR embeddings by author and time period (semi-annual)
  7. PCA Reduction → Apply 4-stage transformation pipeline to reduce to 1D
  8. Time Filtering → Select longest consecutive period with complete author coverage
  9. Visualization → Generate time-series plots showing opinion evolution

Each script in the pipeline writes intermediate results to disk, enabling debugging and pipeline restart from any stage.

Sources: get_embedding.py L24-L40

get_average_time.py L28-L118

get_old_new_snippet.py L16-L85

Citation

If you use this repository in your research, please cite the following article:

@article{HE2026102824,
title = {Social life of code: Modeling evolution through code embedding and opinion dynamics},
journal = {Journal of Computational Science},
volume = {96},
pages = {102824},
year = {2026},
issn = {1877-7503},
doi = {https://doi.org/10.1016/j.jocs.2026.102824},
url = {https://www.sciencedirect.com/science/article/pii/S1877750326000426},
author = {Yulong He and Nikita Verbin and Sergey Kovalchuk},
keywords = {Opinion dynamic, NLP, Human behavior analysis, Codebase evolution, Social-technical analysis},
abstract = {Software repositories capture rich traces of collaborative software development, yet extracting interpretable insights about how developer interactions shape codebase evolution remains challenging. In this work, we present a novel analytical framework that combines semantic representations of code changes with opinion dynamics theory to reveal latent collaboration patterns in software projects. Rather than focusing solely on code artifacts, our approach characterizes how developers influence one another over time and how consensus or divergence emerges during the evolution of a codebase. Applying this framework to multiple large-scale open-source GitHub repositories, we uncover clear and interpretable behavioral trends, including the formation of stable consensus, the presence of influential developers who shape project direction, and periods of fragmentation corresponding to major development shifts. These dynamics are shown to be consistent across projects while also reflecting repository-specific collaboration styles and governance structures. Our results demonstrate that modeling software evolution through the lens of opinion dynamics provides actionable insights into developer influence, knowledge sharing, and long-term project sustainability. By bridging software engineering with computational social science, this work offers a new perspective on understanding and improving collaborative software development in open-source ecosystems.}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages