English

CrystalLM ✨

A Chinese Word-level Language Model Implemented from Scratch (GRU-based)

CrystalLM is an educational and experimental Chinese language modeling project. Starting from the most fundamental text processing, it incrementally implements:

Chinese word segmentation (jieba)
Vocabulary construction, pruning, and governance (<UNK> mechanism)
Word-level recurrent language model (GRU)
Text generation with temperature / sampling control
Model checkpoint saving and loading (directly reusable for inference)

This project intentionally avoids high-level wrapper frameworks. All core logic is written from scratch, making it ideal for understanding the entire lifecycle of a language model—from raw text to generated output.

✨ Feature Overview

📖 Word-level language modeling (Chinese)
🧠 GRU-based sequence modeling (2–3 layers supported)
✂️ Vocabulary pruning + <UNK> fallback
🔥 Temperature / top-p style generation control
💾 Model checkpoint save & load
🧪 Designed for learning, inspection, and modification

📂 Recommended Project Structure

CrystalLM/
├── train.py               # Training script
├── generate.py            # Load model & generate text
├── model.py               # Word-level GRU model definition
├── train.txt              # Training corpus
├── crystallm_gru.pt       # Trained model (optional)
└── README.md

Using a single-file script is perfectly acceptable at early stages. Modularization can be done later when the design stabilizes.

⚙️ Environment Dependencies

Python 3.10+
PyTorch
jieba

Install dependencies:

pip install torch jieba

🚀 Quick Start

1️⃣ Train the Model

Prepare a Chinese corpus in train.txt, then run:

python train.py

After training, a model checkpoint will be generated, for example:

crystallm_gru.pt

2️⃣ Generate Text (No Retraining Required)

python generate.py

Or invoke generation directly in code:

print(generate("你", temperature=0.8))

🧠 Model Specifications

Model Type: Word-level GRU Language Model
Embedding Dim: ~128
Hidden Dim: ~256
GRU Layers: 2–3
Context Window: ~50 tokens (configurable)
Loss Function: CrossEntropyLoss
Optimizer: Adam

During training, loss converges steadily. Generated text gradually reflects recognizable Chinese structure and rhythm.

📌 About `<UNK>` (Vocabulary Governance)

To control vocabulary size and training stability, low-frequency tokens are mapped to <UNK>.

This is a standard and necessary practice in classical language modeling.

To reduce <UNK> frequency:

Increase corpus size
Adjust min_freq
Actively clean low-quality or noisy tokens

CrystalLM treats vocabulary management as a first-class modeling concern, not an afterthought.

🎛️ Temperature Explanation

temperature < 1.0 → Conservative, stable output
temperature = 1.0 → Neutral sampling
temperature > 1.0 → More diverse, higher entropy

Example:

generate("你", temperature=0.7)
generate("你", temperature=1.2)

🎯 Project Goal

CrystalLM does not aim for SOTA performance.

Its purpose is explicit and pragmatic:

To walk through how a language model actually works, using readable code and explicit design trade-offs.

Best suited for:

NLP / deep learning beginners
Learners transitioning from RNNs to Transformers
Anyone who wants to truly understand vocabulary, context, and generation

📢 Why CrystalLM Is a Serious Educational Project

CrystalLM is not a toy project, nor a demo assembled from black-box APIs.

It is a deliberately constrained, explicitly designed educational system.

In an era dominated by large-scale Transformers and opaque training pipelines, CrystalLM takes a different stance:

Understanding precedes scaling. Clarity precedes performance.

What makes CrystalLM serious:

End-to-end transparency Every stage—from tokenization to vocabulary pruning, from sequence modeling to sampling—is explicit and inspectable.
Intentional architectural choices GRU is not used because it is “simpler”, but because it exposes:
- temporal dependency modeling
- hidden state dynamics
- the real cost of context length …without abstraction leakage.
Vocabulary governance as a first-class problem <UNK> handling, low-frequency pruning, and corpus quality are treated as modeling decisions, not preprocessing trivia.
No reliance on pretrained magic All learning emerges from the provided data and code paths, making failure modes observable and instructive.

CrystalLM is designed for learners who want to reason about language models, not just run them.

It prioritizes:

conceptual correctness over benchmark chasing
reproducibility over scale
understanding over illusion of intelligence

If modern LLMs are skyscrapers, CrystalLM is the structural engineering course that explains why they stand.

🔮 Future Plans (Roadmap)

GRU-based implementation
GRU vs LSTM comparative experiments
Transformer / Mini-GPT variant
CLI-based interactive generation
Vocabulary statistics & evaluation logging

📜 License

MIT License

🙌 Acknowledgments

PyTorch
jieba Chinese word segmentation
Everyone curious enough to build models from first principles

简体中文

CrystalLM ✨

一个从零实现的中文词级语言模型（GRU）

CrystalLM 是一个教学导向 + 实验导向的中文语言模型项目。它从最基础的文本处理开始，完整实现了：

中文分词（jieba）
词表构建、裁剪与治理（<UNK> 机制）
基于 GRU 的词级语言模型
可控文本生成（temperature / 采样策略）
模型 checkpoint 保存与直接复用

本项目刻意不依赖高阶封装框架，核心逻辑全部手写，适合系统性理解语言模型的工作原理。

✨ 特性一览

📖 中文词级建模
🧠 GRU 序列模型（支持多层）
✂️ 低频词裁剪 + <UNK> 兜底
🔥 Temperature / 采样风格控制
💾 模型保存 / 加载
🧪 适合学习、拆解与二次开发

📂 项目结构（推荐）

CrystalLM/
├── train.py               # 训练脚本
├── generate.py            # 加载模型并生成文本
├── model.py               # 词级 GRU 模型定义
├── train.txt              # 训练语料
├── crystallm_gru.pt       # 已训练模型（可选）
└── README.md

当前使用单文件结构也完全没问题，在模型稳定后再拆分是更务实的路线。

⚙️ 环境依赖

Python 3.10+
PyTorch
jieba

pip install torch jieba

🚀 快速开始

1️⃣ 训练模型

准备 train.txt（中文文本），运行：

python train.py

训练结束后会生成模型文件，例如：

crystallm_gru.pt

2️⃣ 生成文本（无需重新训练）

python generate.py

或在代码中调用：

print(generate("你", temperature=0.8))

🧠 模型说明

模型类型：Word-level GRU Language Model
Embedding 维度：约 128
Hidden 维度：约 256
GRU 层数：2–3 层
上下文窗口：约 50 个词（可调）
损失函数：CrossEntropyLoss
优化器：Adam

随着训练推进，loss 稳定下降，生成文本逐渐呈现出中文的节奏与结构。

📌 关于 `<UNK>`（词表治理）

为控制词表规模并提升训练稳定性，低频或噪声词会被映射为 <UNK>。

这是经典语言模型中的标准做法，而非缺陷。

降低 <UNK> 的方式包括：

扩充训练语料
调整 min_freq
主动清理异常与低质量词条

在 CrystalLM 中，词表治理是核心设计之一。

🎛️ Temperature 说明

temperature < 1.0：更稳、更保守
temperature = 1.0：标准采样
temperature > 1.0：更发散、更具随机性

generate("你", temperature=0.7)
generate("你", temperature=1.2)

🎯 项目目标

CrystalLM 的目标从来不是 SOTA，而是：

用尽可能透明的代码，走完一遍「语言模型是如何工作的」。

非常适合：

NLP / 深度学习初学者
从 RNN 过渡到 Transformer 的学习者
想真正理解“词表、上下文、生成机制”的人

📢 为什么 CrystalLM 是一个“严肃的”教学项目

CrystalLM 不是玩具，也不是拼接 API 的演示脚本。

它是一个有意识设限、设计目标明确的教学型语言模型系统。

在当下 Transformer 与超大模型主导的语境中，CrystalLM 选择了一条不同的路线：

先理解，再扩展。先透明，再性能。

CrystalLM 的“严肃性”体现在：

端到端的可解释性 从分词、词表裁剪，到序列建模、采样生成，每一步都是显式、可追踪、可修改的。
有意识的架构取舍 使用 GRU 不是因为“简单”，而是因为它能真实暴露：
- 时间依赖如何建模
- hidden state 如何演化
- 上下文长度的真实代价
将词表治理视为核心问题 <UNK>、低频词、语料噪声不是“预处理细节”，而是直接影响模型行为的建模决策。
拒绝预训练黑箱 模型能力完全来自数据与代码本身，因而所有失败模式都是可观察、可学习的。

CrystalLM 面向的是想真正“想明白语言模型”的学习者，而不是只想“跑起来”的使用者。

它优先考虑：

概念正确性，而非榜单成绩
可复现性，而非规模崇拜
理解深度，而非智能幻觉

如果说现代大模型是摩天大楼，那么 CrystalLM 就是解释它们为何能站立的结构力学课。

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
model		model
.gitignore		.gitignore
4052.txt		4052.txt
README.md		README.md
config.py		config.py
crystallm_gpt.pt		crystallm_gpt.pt
crystallm_wordgru.pt		crystallm_wordgru.pt
generate_GPT.py		generate_GPT.py
generate_GRU.py		generate_GRU.py
generate_LSTM.py		generate_LSTM.py
howManyTokens.py		howManyTokens.py
main copy 2.py		main copy 2.py
main copy 3.py		main copy 3.py
main copy 4.py		main copy 4.py
main copy.py		main copy.py
main.py		main.py
remove_empty_lines.py		remove_empty_lines.py
train copy.txt		train copy.txt
train.txt		train.txt
train_cn_small.txt		train_cn_small.txt
train_en.txt		train_en.txt

Player9753193/CrystalLM

Folders and files

Latest commit

History

Repository files navigation

English

CrystalLM ✨

✨ Feature Overview

📂 Recommended Project Structure

⚙️ Environment Dependencies

🚀 Quick Start

1️⃣ Train the Model

2️⃣ Generate Text (No Retraining Required)

🧠 Model Specifications

📌 About <UNK> (Vocabulary Governance)

🎛️ Temperature Explanation

🎯 Project Goal

📢 Why CrystalLM Is a Serious Educational Project

What makes CrystalLM serious:

🔮 Future Plans (Roadmap)

📜 License

🙌 Acknowledgments

简体中文

CrystalLM ✨

✨ 特性一览

📂 项目结构（推荐）

⚙️ 环境依赖

🚀 快速开始

1️⃣ 训练模型

2️⃣ 生成文本（无需重新训练）

🧠 模型说明

📌 关于 <UNK>（词表治理）

🎛️ Temperature 说明

🎯 项目目标

📢 为什么 CrystalLM 是一个“严肃的”教学项目

CrystalLM 的“严肃性”体现在：

🔮 后续计划

📜 License

🙌 致谢

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 7

Languages

📌 About `<UNK>` (Vocabulary Governance)

📌 关于 `<UNK>`（词表治理）