-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathtutorials.html
More file actions
141 lines (137 loc) · 6.76 KB
/
tutorials.html
File metadata and controls
141 lines (137 loc) · 6.76 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="Tutorials and public datasets for language-related data science work.">
<title>Brendan Tomoschuk - Tutorials and Datasets</title>
<link rel="stylesheet" href="styles.css">
</head>
<body>
<header class="site-header site-header--home">
<div class="container">
<h1>Brendan Tomoschuk</h1>
<nav class="site-nav" aria-label="Main">
<a href="index.html">Home</a>
<a href="publications.html">Publications</a>
<a href="patents.html">Patents</a>
<a href="teaching.html">Teaching</a>
<a href="tutorials.html" class="active">Tutorials and Datasets</a>
<a href="downloads/index.html">Downloads</a>
</nav>
</div>
</header>
<main class="section">
<div class="container">
<h2>Tutorials and Datasets</h2>
<p>
Writeups that apply data science methods to language data and other public-interest
problems, plus links to datasets those projects used or that I have collected.
</p>
<h3 class="content-section-heading">Tutorials</h3>
<ul class="project-list">
<li>
<div class="project-card-head">
<h3>Duolingo Language Difficulty Analysis</h3>
<span class="project-lang">Language: R</span>
</div>
<p>
Analysis of Duolingo learner performance to compare relative difficulty among common
languages for English-speaking users. Uses multiple regression and LMER.
</p>
<p class="links">
<a href="website/duodata.html">Notebook</a>
|
<a href="https://github.com/tomoschuk/DuolingoData" target="_blank" rel="noopener noreferrer">GitHub</a>
</p>
</li>
<li>
<div class="project-card-head">
<h3>Reddit Sentiment Comparison</h3>
<span class="project-lang">Language: Python</span>
</div>
<p>
Sentiment comparison tool to quantify how different Reddit communities react to the
same entertainment topic. Uses PRAW, pandas, and scikit-learn (TF–IDF, Naive Bayes
sentiment classification, LDA topic modeling), with seaborn and matplotlib.
</p>
<p class="links">
<a href="website/ironfist.html">Notebook</a>
|
<a href="https://github.com/tomoschuk/RedditComparisons" target="_blank" rel="noopener noreferrer">GitHub</a>
</p>
</li>
<li>
<div class="project-card-head">
<h3>Critical Role Transcript Explorer</h3>
<span class="project-lang">Language: Python</span>
</div>
<p>
Interactive dashboard for exploring language patterns in Critical Role transcripts,
with phrase-level and character-level comparisons. Uses pandas, Plotly, and Dash.
</p>
<p class="links">
<a href="website/critrole.html">Notebook</a>
|
<a href="https://github.com/tomoschuk/TranscriptExplorer" target="_blank" rel="noopener noreferrer">GitHub</a>
</p>
</li>
<li>
<div class="project-card-head">
<h3>CaseGuide (traffic court outcomes)</h3>
<span class="project-lang">Language: Python</span>
</div>
<p>
Analysis of civil-traffic court records (Florida county, FOIA) to explore how case
context and demographics relate to outcomes when people contest speeding tickets.
Random forest and XGBoost multiclass models, calibration, and Plotly visualizations.
</p>
<p class="links">
<a href="website/caseguide.html">Notebook</a>
</p>
</li>
</ul>
<h3 class="content-section-heading">Datasets</h3>
<p>
Public releases and sources referenced in the tutorials above (and related or just cool datasets I've bookmarked).
</p>
<ul class="dataset-list">
<li>
<a href="https://github.com/duolingo/halflife-regression" target="_blank" rel="noopener noreferrer">Duolingo dataset (half-life regression)</a>
— public data and code for Settles & Meeder (ACL 2016); half-life regression for spaced repetition.
</li>
<li>
<a href="https://sharedtask.duolingo.com/2018.html" target="_blank" rel="noopener noreferrer">Duolingo (EMNLP dataset)</a>
— 2018 shared task on second language acquisition modeling (SLAM); token-level learner data with English, Spanish, and French tracks (BEA / NAACL-HLT 2018).
</li>
<li>
<a href="https://corpus.mml.cam.ac.uk/efcamdat/" target="_blank" rel="noopener noreferrer">EF-Cambridge Open Language Database (EFCAMDAT)</a>
— large open-access corpus of L2 English learner essays with CEFR-style levels; access via application (academic affiliation; distribution through Google Drive after approval).
</li>
<li>
<a href="https://crtranscript.tumblr.com/transcripts" target="_blank" rel="noopener noreferrer">Critical Role transcripts</a>
— community episode transcripts (edited for length and clarity; see their site for caveats).
</li>
<li>
<a href="https://callison-burch.github.io/publications/fireball-dataset.pdf" target="_blank" rel="noopener noreferrer">FIREBALL (D&D actual-play)</a>
— ~25k Discord sessions of real Dungeons & Dragons play via the Avrae bot, with aligned dialogue, executable commands, and ground-truth game state (Zhu et al.; PDF describes data and NLG / utterance-to-command tasks).
</li>
<li>
<a href="https://tatoeba.org/en/downloads" target="_blank" rel="noopener noreferrer">Tatoeba multilingual sentence corpus</a>
— large collection of example sentences and translation links (CC BY / CC0 depending on export); weekly dumps and custom sentence-pair exports.
</li>
</ul>
</div>
</main>
<footer class="site-footer">
<div class="container footer-inner">
<p>© <span id="year">2026</span> Brendan Tomoschuk</p>
<p>
<a href="https://github.com/tomoschuk" target="_blank" rel="noopener noreferrer">GitHub</a>
|
<a href="https://www.linkedin.com/in/tomoschuk/" target="_blank" rel="noopener noreferrer">LinkedIn</a>
</p>
</div>
</footer>
</body>
</html>