Amil Khanzada — Graduate Research in Career Outcomes & Development
This repository implements a reproducible quantitative supplement for the paper "From Volunteer to Vocation: The Career Impact of Skill and Network Development in a Global Tech Nonprofit." Using relative importance analysis (LMG decomposition), we decompose the career-outcome variance attributable to seven skill and network predictors across 78 Virufy volunteers. The full-sample model achieves R² = 0.575, with Leadership Skills (q3) emerging as the strongest predictor (17.2% contribution), followed by Communication Skills (q2) (16.1%) and Network Quality (q6) (15.7%). Subgroup analyses reveal role-specific and career-stage-specific patterns, with students showing stronger leadership effects (23.3%) than professionals (10.4%). SEM fit is mixed but generally acceptable (CFI = 0.996, TLI = 0.994, SRMR = 0.030, RMSEA = 0.083).
Keywords: Career Development · Volunteer Outcomes · Relative Importance Analysis · Psychometric Modeling · Tech Nonprofit
This supplement decomposes six months of Virufy volunteer survey data (April 2025 – September 2025) into three interpretable components:
┌────────────────────────────────────────────────────────────┐
│ CAREER OUTCOMES PREDICTION PIPELINE │
│ │
│ ┌────────────────────┐ ┌──────────────────────┐ │
│ │ Skill Predictors │ │ Network Predictors │ │
│ │ │ │ │ │
│ │ • q1: Technical │ │ • q5: Size │ │
│ │ • q2: Comm. │ │ • q6: Quality │ │
│ │ • q3: Leadership │ │ • q7: Access │ │
│ │ • q4: Time Mgmt │ │ │ │
│ └─────────┬──────────┘ └──────────┬───────────┘ │
│ │ │ │
│ └────────────┬───────────┘ │
│ │ │
│ Feature Engineering │
│ Scaling · Missingness Audit │
│ Complete-Case Deletion (n=78) │
│ │ │
│ ┌───────────────┴───────────────┐ │
│ │ OLS Regression (LMG) │ │
│ │ + Bootstrap Confidence │ │
│ │ + VIF Diagnostics │ │
│ │ + SEM Construct Validation │ │
│ └───────────────┬───────────────┘ │
│ │ │
│ ┌────────────────────┼────────────────────┐ │
│ │ │ │ │
│ ┌─▼──────┐ ┌──────────▼────────┐ ┌───────▼─┐ │
│ │ Full │ │ Subgroup Ranking │ │ SEM │ │
│ │ Rank │ │ (Role, Stage, Geo)│ │ Structure│ │
│ │ Order │ │ │ │ Validity │ │
│ └────────┘ └───────────────────┘ └─────────┘ │
│ │
│ ──► relative_importance_results.csv │
│ ──► subgroup_analysis_results.csv │
│ ──► sem_fit_indices.csv │
│ ──► paper_claim_check.csv │
└────────────────────────────────────────────────────────────┘
| Metric | Full Sample | Students | Professionals |
|---|---|---|---|
| n | 78 | 46 | 32 |
| R² (OLS) | 0.575 | 0.709 | 0.484 |
| Top Predictor | q3: Leadership (17.2%) | q3: Leadership (23.3%) | q6: Network Quality (21.5%) |
| #2 Predictor | q2: Communication (16.1%) | q4: Time Mgmt (17.1%) | q1: Technical (18.6%) |
| #3 Predictor | q6: Network Quality (15.7%) | q2: Communication (15.3%) | q2: Communication (15.5%) |
| SEM CFI | 0.996 | — | — |
| SEM RMSEA | 0.083 | — | — |
Role-Type Results:
| Predictor | Tech (n=51) | Non-Tech (n=27) |
|---|---|---|
| q1: Technical | 12.7% | 21.7% ⭐ |
| q2: Communication | 15.3% | 12.9% |
| q3: Leadership | 18.4% ⭐ | 13.6% |
| q6: Network Quality | 16.4% | 15.2% |
From output/participant_flow.csv:
- Input rows: 80
- Complete-case rows (q1–q11): 78
- Excluded (missing core items): 2
To reproduce all results from scratch on a clean machine:
To reproduce all results on your machine:
git clone https://github.com/virufy/paper-career-supplement.git
cd paper-career-supplementRscript --vanilla install_dependencies.ROn Linux (Ubuntu/Debian), you may first need system tools:
sudo apt update && sudo apt install -y build-essential r-base-dev libcurl4-openssl-dev libxml2-dev libssl-devRscript --vanilla run_analysis.RThis single script:
- Auto-detects data source: Uses
input/vector_survey_responses.csvif available (real data), otherwise uses example data for demonstration - Executes full 6-step pipeline: Data audit → Descriptive stats → LMG analysis → Subgroup analysis → SEM → Paper claim verification
- Generates 29 output files: CSV/HTML tables, PNG/SVG visualizations, and session metadata in the
output/directory - Takes ~2-5 minutes depending on your machine (bootstrap iterations: 1,000)
Output files are written to output/ (git-ignored, generated freshly each run):
relative_importance_results.csv ← Main LMG rankings (Table 2)
subgroup_analysis_results.csv ← Stratified findings (role, stage, geography)
sem_fit_indices.csv ← SEM model validation
paper_claim_check.csv ← Automated paper reproducibility audit
correlation_heatmap.png ← Visual predictor correlations
relative_importance_barplot.png ← Visual LMG rankings
subgroup_top_predictors_comparison.png ← Subgroup comparison
[11 additional CSV audit files]
If you have collected your own survey data using the standardized instrument:
- Ensure your CSV has the same structure: ≥18 columns with Likert items in columns 8–18
- Save it as
input/vector_survey_responses.csv - Run:
Rscript --vanilla run_analysis.R - Script automatically detects: "✓ Using real data: input/vector_survey_responses.csv"
| Issue | Solution |
|---|---|
| "Missing package 'X'" | Run install_dependencies.R again or ensure internet connectivity |
| "Data file not found" | Verify your CSV is at input/vector_survey_responses.csv (or use example) |
| Permission errors on Linux | Try: chmod +x *.R && Rscript --vanilla install_dependencies.R |
| Very slow on large N | Edit run_analysis.R line ~220: change R = 1000 to R = 500 for bootstrap iterations |
File: input/vector_survey_responses.csv
| Field | Specification |
|---|---|
| Format | CSV, comma-separated |
| Encoding | UTF-8 |
| Required columns | Minimum 18 (see DATA_DICTIONARY.md) |
| Core Likert items | Columns 8–18 → mapped to q1–q11 |
| Missing data handling | Complete-case deletion: rows with any NA in q1–q11 excluded |
| Primary outcome | q10 (Job/Promotion Success) |
paper-vector-career/
├── README.md (this file)
├── DATA_DICTIONARY.md (variable mappings)
├── SUPPLEMENT.md (academic methods supplement)
├── VERIFICATION_REPORT.md (reproducibility audit)
├── install_dependencies.R (install R packages)
├── run_analysis.R (main analysis pipeline)
├── generate_figures.R (publication figures 1–4)
├── generate_tables.R (publication tables 1–5)
├── input/
│ └── vector_survey_responses.csv (survey data — not in git for privacy)
├── statistical_appendix/
│ ├── README.md
│ ├── reproduce_analysis.R
│ └── vector_survey_responses_example.csv (anonymised example dataset, N=30)
└── output/ (generated — not in git)
├── fig1_correlation_matrix.png
├── fig2_sem_path.png
├── fig3_lmg_forest.png
├── fig4_subgroup_comparison.png
├── table1_geography.html … table5_convergence.html
├── relative_importance_results.csv
├── subgroup_analysis_results.csv
├── sem_fit_indices.csv
├── paper_claim_check.csv
└── [additional CSV diagnostics]
output/data_audit_summary.csv— input rows, complete rows, excluded rowsoutput/participant_flow.csv— stage-by-stage participant countsoutput/core_item_missingness.csv— per-item missing data proportion
output/relative_importance_results.csv— LMG rankings with 95% bootstrap CIsoutput/full_model_diagnostics.csv— R², VIF, Shapirop, Breusch-Pagan poutput/correlation_matrix.csv— Spearman correlation heatmap dataoutput/correlate_heatmap.png— visual correlation matrixoutput/relative_importance_barplot.png— LMG contribution chart
output/subgroup_counts.csv— sample sizes per demographic groupoutput/subgroup_analysis_results.csv— LMG rankings by role/stage/geographyoutput/subgroup_top_predictors_comparison.png— visual comparison
output/sem_fit_indices.csv— CFI, TLI, RMSEA, SRMR, composite correlationsoutput/paper_claim_check.csv— automated claim-by-claim verification
| Method | Package | Purpose |
|---|---|---|
| LMG Relative Importance | relaimpo |
Decompose R² across predictors; handles multicollinearity |
| OLS Diagnostics | car, lmtest |
VIF, Shapiro-Wilk normality, Breusch-Pagan heteroscedasticity |
| Bootstrap Confidence Intervals | boot |
1,000-iteration BCa CI for LMG contributions |
| Spearman Correlation | base R | Non-parametric association for ordinal Likert items |
| Structural Equation Modeling | lavaan WLSMV |
Latent construct validation (Skill_Development, Networking, Career_Outcomes) |
| Subgroup Analysis | Custom | Split samples by role type, career stage, geography |
The Lindeman-Merenda-Gold method decomposes the explained variance (R²) into contributions that account for predictor order and multicollinearity:
where P is the number of predictors and
This repository adheres to the FAIR principles (Findable, Accessible, Interoperable, Reusable):
✓ Findable: versioned, publicly available, documented metadata
✓ Accessible: open-source code, example data included, dependency specifications
✓ Interoperable: standard CSV outputs, standard R ecosystem
✓ Reusable: standalone reproducible scripts
All numeric claims in the paper are automatically verified against code output in output/paper_claim_check.csv. See VERIFICATION_REPORT.md for full audit trail.