Overview

(generated by deepwiki) dataset is on huggingface: https://huggingface.co/datasets/hreyulog/GitHub-3Repo-7User-Opinion-Dynamics

Purpose and Scope

The GitHub Opinion Dynamics system is a research pipeline designed to analyze how developer "opinions" evolve over time within software development communities. The system treats code changes as expressions of opinion, using machine learning embeddings to quantify semantic differences between old and new code versions. By aggregating and analyzing these embeddings across authors and time periods, the system produces time-series visualizations showing how different contributors' coding approaches converge or diverge.

This document provides a high-level introduction to the system architecture, data flow, and key components. For detailed information about specific subsystems:

Data collection and filtering: see Data Collection Pipeline
Embedding generation process: see Embedding Generation System
PCA and dimensionality reduction: see Opinion Analysis and Dimensionality Reduction
Visualization outputs: see Visualization and Output
Network analysis: see Network Analysis

System Architecture

The system follows a multi-stage pipeline architecture, processing GitHub repository data through several transformation layers:

High-Level Architecture Diagram

flowchart TD

subgraph Network_Analysis_Branch ["Network Analysis Branch"]
end

subgraph Visualization_Layer ["Visualization Layer"]
end

subgraph Dimensionality_Reduction ["Dimensionality Reduction"]
end

subgraph Aggregation_Layer ["Aggregation Layer"]
end

subgraph Embedding_Generation_Layer ["Embedding Generation Layer"]
end

subgraph Code_Extraction_Layer ["Code Extraction Layer"]
end

subgraph Data_Collection_Layer ["Data Collection Layer"]
end

MAX["get_max_repo.pyRepository Selection"]
ONE["get_one_repository.pySingle Repo Filtering"]
USERS["get_7_users_pr.pyKey User Selection"]
SNIPPET["get_old_new_snippet.pyPatch Parser"]
OLD["old_code extraction"]
NEW["new_code extraction"]
EMB["get_embedding.pyget_pytorch_serve_embeddings()"]
SERVE["PyTorch Servinglocalhost:8080/predictions/emb_comp"]
DIFF["difference_emb = new_code_emb - old_code_emb"]
PR["get_average_pr.pyPR-Level Aggregation"]
TIME["get_average_time.pycompute_mean_emb()"]
PCA["PCA PipelineStandardScaler → PCA → QuantileTransformer → MinMaxScaler"]
DRAW["opinion_draw.pyTime Series Plots"]
OUT["opinion_{repo}_all3.png"]
LINK["get_link_users.pyInteraction Matrix"]
NET["network_comment_issue.pyInteractive Graphs"]
UTIME["user_issue_time.pyTemporal Activity"]

    MAX --> ONE
    ONE --> USERS
    ONE --> SNIPPET
    SNIPPET --> OLD
    SNIPPET --> NEW
    OLD --> EMB
    NEW --> EMB
    EMB --> SERVE
    EMB --> DIFF
    DIFF --> PR
    PR --> TIME
    TIME --> PCA
    PCA --> DRAW
    DRAW --> OUT
    ONE --> LINK
    ONE --> NET
    ONE --> UTIME

Sources: get_embedding.py L1-L44

get_old_new_snippet.py L1-L89

get_average_time.py L1-L124

Core Data Flow

The system processes data through seven distinct stages:

Stage	Input	Process	Output	Key Script
1. Repository Selection	GitHub PR data	Filter by issue count	`filtered_pr_{repo}.csv`	`get_max_repo.py`
2. User Identification	PR metadata	Identify top 7 contributors	User list	`get_7_users_pr.py`
3. Code Extraction	Git patches	Parse diffs into old/new code	`{repo}_diff.csv`	`get_old_new_snippet.py`
4. Embedding Generation	Code snippets	Query PyTorch model	`difference_code_emb_{repo}.npy`	`get_embedding.py`
5. PR Aggregation	Difference embeddings	Mean by PR	`pr_emb_{repo}.npy`	`get_average_pr.py`
6. Author-Time Aggregation	PR embeddings	Mean by author + time	`author_time_emb_{repo}.npy`	`get_average_time.py`
7. Dimensionality Reduction	768D embeddings	PCA to 1D	`author_time_{repo}.csv`	`get_average_time.py`

Sources: get_embedding.py L6-L44

get_old_new_snippet.py L16-L89

get_average_time.py L28-L118

Key Components

1. PyTorch Serving Interface

The system relies on an external PyTorch model serving endpoint at http://localhost:8080/predictions/emb_comp that converts code strings into 768-dimensional embedding vectors. The get_pytorch_serve_embeddings() function handles communication with this service.

Key Functions:

get_pytorch_serve_embeddings() in get_embedding.py L6-L22
POST requests to localhost:8080/predictions/emb_comp
Returns embeddings as torch.Tensor objects

Sources: get_embedding.py L6-L26

2. Patch Parser

The get_old_new_snippet.py module parses git diff patches to extract the before/after code states for each file change. It processes patches line-by-line, identifying added lines (+), removed lines (-), and unchanged context lines.

Key Processing Steps:

Filter files by extension (exclude .md, include code files with extensions)
Parse diff lines starting with -, +, or context markers
Remove @@ hunk headers using regex
Normalize whitespace to create single-line code strings

Sources: get_old_new_snippet.py L16-L85

3. Embedding Aggregation Pipeline

Embeddings are aggregated in two stages using the compute_mean_emb() function:

flowchart TD

DIFF["difference_code_emb_{repo}.npy(Code-level embeddings)"]
PR_EMB["pr_emb_{repo}.npy(PR-level embeddings)"]
AUTH_EMB["author_time_emb_{repo}.npy(Author-time embeddings)"]

The compute_mean_emb() function get_average_time.py L17-L22

performs array stacking and mean computation:

Sources: get_average_time.py L17-L40

get_embedding.py L30-L34

4. Multi-Stage Transformation Pipeline

The dimensionality reduction pipeline in get_average_time.py applies four sequential transformations to convert 768D embeddings to interpretable 1D opinion scores:

flowchart TD

INPUT["author_time_emb_{repo}.npy(768 dimensions)"]
STAGE1["StandardScalerNormalize to μ=0, σ=1"]
STAGE2["PCA(n_components=1)Extract principal component"]
STAGE3["QuantileTransformeroutput_distribution='normal'"]
STAGE4["MinMaxScalerScale to [0,1]"]
OUTPUT["author_time_{repo}.csv(1 dimension)"]

    INPUT --> STAGE1
    STAGE1 --> STAGE2
    STAGE2 --> STAGE3
    STAGE3 --> STAGE4
    STAGE4 --> OUTPUT

Implementation: get_average_time.py L46-L58

Sources: get_average_time.py L42-L62

5. Time Series Filtering

The system identifies consecutive time periods where all authors have data, selecting the longest such period for visualization. The algorithm get_average_time.py L71-L96

groups consecutive months with complete author coverage.

Key Logic:

Sort time periods chronologically
Group consecutive months where all authors contributed
Select longest consecutive group
Apply additional 4-month offset and 12-month window

Sources: get_average_time.py L68-L96

Data Artifacts

The system generates multiple intermediate and final data artifacts at each stage:

Embedding Artifacts

Artifact	Dimensions	Description
`new_code_emb_{repo}.npy`	N × 768	Embeddings of new code versions
`old_code_emb_{repo}.npy`	N × 768	Embeddings of old code versions
`difference_code_emb_{repo}.npy`	N × 768	Difference embeddings (new - old)
`pr_emb_{repo}.npy`	M × 768	PR-level aggregated embeddings
`author_time_emb_{repo}.npy`	K × 768	Author-time aggregated embeddings

Sources: get_embedding.py L32-L34

[get_average_time.py

40](https://github.com/hreyulog/github_opinion_dynamics/blob/6e264d44/get_average_time.py#L40-L40)

Output Artifacts

Artifact	Format	Description
`opinion_{repo}_all3.png`	PNG	Time series plot of author opinions
`author_time_{repo}.csv`	CSV	Author-time embeddings with 1D PCA scores
`{repo}_for_mathematica.csv`	CSV	Pivoted time series for external analysis
`pr_time_{repo}.csv`	CSV	PR metadata with timestamps and authors

Sources: get_average_time.py L62-L117

Execution Model

The system operates in batch mode, processing multiple repositories sequentially:

flowchart TD

START["Script Execution"]
REPOS["repo_names = ['swift', 'pytorch', 'ceph']"]
LOOP["For each repo_name in repo_names"]
MAIN["main(repo_name)"]
END["All repos processed"]

    START --> REPOS
    REPOS --> LOOP
    LOOP --> MAIN
    MAIN --> LOOP
    LOOP --> END

Example: get_average_time.py L119-L124

shows the standard batch execution pattern used across all main scripts.

Sources: get_average_time.py L119-L124

get_embedding.py L37-L40

get_old_new_snippet.py L86-L89

System Dependencies

External Services

PyTorch Serving: HTTP endpoint at localhost:8080 serving the emb_comp model
Network File System: Data storage at /srv/nfs/VESO/... (referenced in diagrams)

Python Libraries

Data Processing: pandas, numpy
ML/Dimensionality Reduction: sklearn (PCA, StandardScaler, QuantileTransformer, MinMaxScaler)
Deep Learning: torch (for tensor operations)
HTTP Communication: requests (for PyTorch Serving API calls)
Visualization: matplotlib, seaborn, plotly (for plots and graphs)
Network Analysis: networkx (for graph operations)

Sources: get_average_time.py L1-L12

get_embedding.py L1-L5

Target Repositories

The system is configured to analyze three large-scale open source projects:

Repository	Language	Notable Characteristics
swift	Swift	Apple's programming language
pytorch	Python/C++	Deep learning framework
ceph	C++	Distributed storage system

These repositories were selected for their high activity levels and diverse contributor bases.

Sources: [get_average_time.py

120](https://github.com/hreyulog/github_opinion_dynamics/blob/6e264d44/get_average_time.py#L120-L120)

[get_embedding.py

38](https://github.com/hreyulog/github_opinion_dynamics/blob/6e264d44/get_embedding.py#L38-L38)

[get_old_new_snippet.py

87](https://github.com/hreyulog/github_opinion_dynamics/blob/6e264d44/get_old_new_snippet.py#L87-L87)

System Workflow Summary

The complete execution workflow follows this sequence:

Data Collection → Filter repositories and identify key contributors
Code Extraction → Parse git patches into old/new code pairs
Embedding Generation → Query PyTorch model to get 768D vectors for code
Compute Differences → Calculate new_code_emb - old_code_emb
PR Aggregation → Average difference embeddings by pull request
Author-Time Aggregation → Average PR embeddings by author and time period (semi-annual)
PCA Reduction → Apply 4-stage transformation pipeline to reduce to 1D
Time Filtering → Select longest consecutive period with complete author coverage
Visualization → Generate time-series plots showing opinion evolution

Each script in the pipeline writes intermediate results to disk, enabling debugging and pipeline restart from any stage.

Sources: get_embedding.py L24-L40

get_average_time.py L28-L118

get_old_new_snippet.py L16-L85

Citation

If you use this repository in your research, please cite the following article:

@article{HE2026102824,
title = {Social life of code: Modeling evolution through code embedding and opinion dynamics},
journal = {Journal of Computational Science},
volume = {96},
pages = {102824},
year = {2026},
issn = {1877-7503},
doi = {https://doi.org/10.1016/j.jocs.2026.102824},
url = {https://www.sciencedirect.com/science/article/pii/S1877750326000426},
author = {Yulong He and Nikita Verbin and Sergey Kovalchuk},
keywords = {Opinion dynamic, NLP, Human behavior analysis, Codebase evolution, Social-technical analysis},
abstract = {Software repositories capture rich traces of collaborative software development, yet extracting interpretable insights about how developer interactions shape codebase evolution remains challenging. In this work, we present a novel analytical framework that combines semantic representations of code changes with opinion dynamics theory to reveal latent collaboration patterns in software projects. Rather than focusing solely on code artifacts, our approach characterizes how developers influence one another over time and how consensus or divergence emerges during the evolution of a codebase. Applying this framework to multiple large-scale open-source GitHub repositories, we uncover clear and interpretable behavioral trends, including the formation of stable consensus, the presence of influential developers who shape project direction, and periods of fragmentation corresponding to major development shifts. These dynamics are shown to be consistent across projects while also reflecting repository-specific collaboration styles and governance structures. Our results demonstrate that modeling software evolution through the lens of opinion dynamics provides actionable insights into developer influence, knowledge sharing, and long-term project sustainability. By bridging software engineering with computational social science, this work offers a new perspective on understanding and improving collaborative software development in open-source ecosystems.}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Purpose and Scope

System Architecture

High-Level Architecture Diagram

Core Data Flow

Key Components

1. PyTorch Serving Interface

2. Patch Parser

3. Embedding Aggregation Pipeline

4. Multi-Stage Transformation Pipeline

5. Time Series Filtering

Data Artifacts

Embedding Artifacts

Output Artifacts

Execution Model

System Dependencies

External Services

Python Libraries

Target Repositories

System Workflow Summary

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
PCA.py		PCA.py
PCA_performance.py		PCA_performance.py
README.md		README.md
get_7_users_pr.py		get_7_users_pr.py
get_average_pr.py		get_average_pr.py
get_average_time.py		get_average_time.py
get_embedding.py		get_embedding.py
get_link_users.py		get_link_users.py
get_mathematica.py		get_mathematica.py
get_max_repo.py		get_max_repo.py
get_old_new_snippet.py		get_old_new_snippet.py
get_one_repository.py		get_one_repository.py
network_comment_issue.py		network_comment_issue.py
opinion_draw.py		opinion_draw.py
test.py		test.py
test_diff.py		test_diff.py
test_embedding.py		test_embedding.py
user_issue_time.py		user_issue_time.py

Folders and files

Latest commit

History

Repository files navigation

Overview

Purpose and Scope

System Architecture

High-Level Architecture Diagram

Core Data Flow

Key Components

1. PyTorch Serving Interface

2. Patch Parser

3. Embedding Aggregation Pipeline

4. Multi-Stage Transformation Pipeline

5. Time Series Filtering

Data Artifacts

Embedding Artifacts

Output Artifacts

Execution Model

System Dependencies

External Services

Python Libraries

Target Repositories

System Workflow Summary

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages