(generated by deepwiki) dataset is on huggingface: https://huggingface.co/datasets/hreyulog/GitHub-3Repo-7User-Opinion-Dynamics
The GitHub Opinion Dynamics system is a research pipeline designed to analyze how developer "opinions" evolve over time within software development communities. The system treats code changes as expressions of opinion, using machine learning embeddings to quantify semantic differences between old and new code versions. By aggregating and analyzing these embeddings across authors and time periods, the system produces time-series visualizations showing how different contributors' coding approaches converge or diverge.
This document provides a high-level introduction to the system architecture, data flow, and key components. For detailed information about specific subsystems:
- Data collection and filtering: see Data Collection Pipeline
- Embedding generation process: see Embedding Generation System
- PCA and dimensionality reduction: see Opinion Analysis and Dimensionality Reduction
- Visualization outputs: see Visualization and Output
- Network analysis: see Network Analysis
The system follows a multi-stage pipeline architecture, processing GitHub repository data through several transformation layers:
flowchart TD
subgraph Network_Analysis_Branch ["Network Analysis Branch"]
end
subgraph Visualization_Layer ["Visualization Layer"]
end
subgraph Dimensionality_Reduction ["Dimensionality Reduction"]
end
subgraph Aggregation_Layer ["Aggregation Layer"]
end
subgraph Embedding_Generation_Layer ["Embedding Generation Layer"]
end
subgraph Code_Extraction_Layer ["Code Extraction Layer"]
end
subgraph Data_Collection_Layer ["Data Collection Layer"]
end
MAX["get_max_repo.pyRepository Selection"]
ONE["get_one_repository.pySingle Repo Filtering"]
USERS["get_7_users_pr.pyKey User Selection"]
SNIPPET["get_old_new_snippet.pyPatch Parser"]
OLD["old_code extraction"]
NEW["new_code extraction"]
EMB["get_embedding.pyget_pytorch_serve_embeddings()"]
SERVE["PyTorch Servinglocalhost:8080/predictions/emb_comp"]
DIFF["difference_emb = new_code_emb - old_code_emb"]
PR["get_average_pr.pyPR-Level Aggregation"]
TIME["get_average_time.pycompute_mean_emb()"]
PCA["PCA PipelineStandardScaler → PCA → QuantileTransformer → MinMaxScaler"]
DRAW["opinion_draw.pyTime Series Plots"]
OUT["opinion_{repo}_all3.png"]
LINK["get_link_users.pyInteraction Matrix"]
NET["network_comment_issue.pyInteractive Graphs"]
UTIME["user_issue_time.pyTemporal Activity"]
MAX --> ONE
ONE --> USERS
ONE --> SNIPPET
SNIPPET --> OLD
SNIPPET --> NEW
OLD --> EMB
NEW --> EMB
EMB --> SERVE
EMB --> DIFF
DIFF --> PR
PR --> TIME
TIME --> PCA
PCA --> DRAW
DRAW --> OUT
ONE --> LINK
ONE --> NET
ONE --> UTIME
Sources: get_embedding.py L1-L44
The system processes data through seven distinct stages:
| Stage | Input | Process | Output | Key Script |
|---|---|---|---|---|
| 1. Repository Selection | GitHub PR data | Filter by issue count | filtered_pr_{repo}.csv |
get_max_repo.py |
| 2. User Identification | PR metadata | Identify top 7 contributors | User list | get_7_users_pr.py |
| 3. Code Extraction | Git patches | Parse diffs into old/new code | {repo}_diff.csv |
get_old_new_snippet.py |
| 4. Embedding Generation | Code snippets | Query PyTorch model | difference_code_emb_{repo}.npy |
get_embedding.py |
| 5. PR Aggregation | Difference embeddings | Mean by PR | pr_emb_{repo}.npy |
get_average_pr.py |
| 6. Author-Time Aggregation | PR embeddings | Mean by author + time | author_time_emb_{repo}.npy |
get_average_time.py |
| 7. Dimensionality Reduction | 768D embeddings | PCA to 1D | author_time_{repo}.csv |
get_average_time.py |
Sources: get_embedding.py L6-L44
get_old_new_snippet.py L16-L89
The system relies on an external PyTorch model serving endpoint at http://localhost:8080/predictions/emb_comp that converts code strings into 768-dimensional embedding vectors. The get_pytorch_serve_embeddings() function handles communication with this service.
Key Functions:
get_pytorch_serve_embeddings()in get_embedding.py L6-L22- POST requests to
localhost:8080/predictions/emb_comp - Returns embeddings as
torch.Tensorobjects
Sources: get_embedding.py L6-L26
The get_old_new_snippet.py module parses git diff patches to extract the before/after code states for each file change. It processes patches line-by-line, identifying added lines (+), removed lines (-), and unchanged context lines.
Key Processing Steps:
- Filter files by extension (exclude
.md, include code files with extensions) - Parse diff lines starting with
-,+, or context markers - Remove
@@hunk headers using regex - Normalize whitespace to create single-line code strings
Sources: get_old_new_snippet.py L16-L85
Embeddings are aggregated in two stages using the compute_mean_emb() function:
flowchart TD
DIFF["difference_code_emb_{repo}.npy(Code-level embeddings)"]
PR_EMB["pr_emb_{repo}.npy(PR-level embeddings)"]
AUTH_EMB["author_time_emb_{repo}.npy(Author-time embeddings)"]
The compute_mean_emb() function get_average_time.py L17-L22
performs array stacking and mean computation:
Sources: get_average_time.py L17-L40
The dimensionality reduction pipeline in get_average_time.py applies four sequential transformations to convert 768D embeddings to interpretable 1D opinion scores:
flowchart TD
INPUT["author_time_emb_{repo}.npy(768 dimensions)"]
STAGE1["StandardScalerNormalize to μ=0, σ=1"]
STAGE2["PCA(n_components=1)Extract principal component"]
STAGE3["QuantileTransformeroutput_distribution='normal'"]
STAGE4["MinMaxScalerScale to [0,1]"]
OUTPUT["author_time_{repo}.csv(1 dimension)"]
INPUT --> STAGE1
STAGE1 --> STAGE2
STAGE2 --> STAGE3
STAGE3 --> STAGE4
STAGE4 --> OUTPUT
Implementation: get_average_time.py L46-L58
Sources: get_average_time.py L42-L62
The system identifies consecutive time periods where all authors have data, selecting the longest such period for visualization. The algorithm get_average_time.py L71-L96
groups consecutive months with complete author coverage.
Key Logic:
- Sort time periods chronologically
- Group consecutive months where all authors contributed
- Select longest consecutive group
- Apply additional 4-month offset and 12-month window
Sources: get_average_time.py L68-L96
The system generates multiple intermediate and final data artifacts at each stage:
| Artifact | Dimensions | Description |
|---|---|---|
new_code_emb_{repo}.npy |
N × 768 | Embeddings of new code versions |
old_code_emb_{repo}.npy |
N × 768 | Embeddings of old code versions |
difference_code_emb_{repo}.npy |
N × 768 | Difference embeddings (new - old) |
pr_emb_{repo}.npy |
M × 768 | PR-level aggregated embeddings |
author_time_emb_{repo}.npy |
K × 768 | Author-time aggregated embeddings |
Sources: get_embedding.py L32-L34
[get_average_time.py
40](https://github.com/hreyulog/github_opinion_dynamics/blob/6e264d44/get_average_time.py#L40-L40)
| Artifact | Format | Description |
|---|---|---|
opinion_{repo}_all3.png |
PNG | Time series plot of author opinions |
author_time_{repo}.csv |
CSV | Author-time embeddings with 1D PCA scores |
{repo}_for_mathematica.csv |
CSV | Pivoted time series for external analysis |
pr_time_{repo}.csv |
CSV | PR metadata with timestamps and authors |
Sources: get_average_time.py L62-L117
The system operates in batch mode, processing multiple repositories sequentially:
flowchart TD
START["Script Execution"]
REPOS["repo_names = ['swift', 'pytorch', 'ceph']"]
LOOP["For each repo_name in repo_names"]
MAIN["main(repo_name)"]
END["All repos processed"]
START --> REPOS
REPOS --> LOOP
LOOP --> MAIN
MAIN --> LOOP
LOOP --> END
Example: get_average_time.py L119-L124
shows the standard batch execution pattern used across all main scripts.
Sources: get_average_time.py L119-L124
get_old_new_snippet.py L86-L89
- PyTorch Serving: HTTP endpoint at
localhost:8080serving theemb_compmodel - Network File System: Data storage at
/srv/nfs/VESO/...(referenced in diagrams)
- Data Processing:
pandas,numpy - ML/Dimensionality Reduction:
sklearn(PCA, StandardScaler, QuantileTransformer, MinMaxScaler) - Deep Learning:
torch(for tensor operations) - HTTP Communication:
requests(for PyTorch Serving API calls) - Visualization:
matplotlib,seaborn,plotly(for plots and graphs) - Network Analysis:
networkx(for graph operations)
Sources: get_average_time.py L1-L12
The system is configured to analyze three large-scale open source projects:
| Repository | Language | Notable Characteristics |
|---|---|---|
| swift | Swift | Apple's programming language |
| pytorch | Python/C++ | Deep learning framework |
| ceph | C++ | Distributed storage system |
These repositories were selected for their high activity levels and diverse contributor bases.
Sources: [get_average_time.py
120](https://github.com/hreyulog/github_opinion_dynamics/blob/6e264d44/get_average_time.py#L120-L120)
[get_embedding.py
38](https://github.com/hreyulog/github_opinion_dynamics/blob/6e264d44/get_embedding.py#L38-L38)
[get_old_new_snippet.py
87](https://github.com/hreyulog/github_opinion_dynamics/blob/6e264d44/get_old_new_snippet.py#L87-L87)
The complete execution workflow follows this sequence:
- Data Collection → Filter repositories and identify key contributors
- Code Extraction → Parse git patches into old/new code pairs
- Embedding Generation → Query PyTorch model to get 768D vectors for code
- Compute Differences → Calculate
new_code_emb - old_code_emb - PR Aggregation → Average difference embeddings by pull request
- Author-Time Aggregation → Average PR embeddings by author and time period (semi-annual)
- PCA Reduction → Apply 4-stage transformation pipeline to reduce to 1D
- Time Filtering → Select longest consecutive period with complete author coverage
- Visualization → Generate time-series plots showing opinion evolution
Each script in the pipeline writes intermediate results to disk, enabling debugging and pipeline restart from any stage.
Sources: get_embedding.py L24-L40
get_old_new_snippet.py L16-L85
If you use this repository in your research, please cite the following article:
@article{HE2026102824,
title = {Social life of code: Modeling evolution through code embedding and opinion dynamics},
journal = {Journal of Computational Science},
volume = {96},
pages = {102824},
year = {2026},
issn = {1877-7503},
doi = {https://doi.org/10.1016/j.jocs.2026.102824},
url = {https://www.sciencedirect.com/science/article/pii/S1877750326000426},
author = {Yulong He and Nikita Verbin and Sergey Kovalchuk},
keywords = {Opinion dynamic, NLP, Human behavior analysis, Codebase evolution, Social-technical analysis},
abstract = {Software repositories capture rich traces of collaborative software development, yet extracting interpretable insights about how developer interactions shape codebase evolution remains challenging. In this work, we present a novel analytical framework that combines semantic representations of code changes with opinion dynamics theory to reveal latent collaboration patterns in software projects. Rather than focusing solely on code artifacts, our approach characterizes how developers influence one another over time and how consensus or divergence emerges during the evolution of a codebase. Applying this framework to multiple large-scale open-source GitHub repositories, we uncover clear and interpretable behavioral trends, including the formation of stable consensus, the presence of influential developers who shape project direction, and periods of fragmentation corresponding to major development shifts. These dynamics are shown to be consistent across projects while also reflecting repository-specific collaboration styles and governance structures. Our results demonstrate that modeling software evolution through the lens of opinion dynamics provides actionable insights into developer influence, knowledge sharing, and long-term project sustainability. By bridging software engineering with computational social science, this work offers a new perspective on understanding and improving collaborative software development in open-source ecosystems.}
}