FastAPI service for document chunking and sentence-transformer embeddings. It is intended for RAG pipelines, semantic search, vector database ingestion, and document indexing workflows.
The API accepts files, extracts text or tabular content, creates tokenizer-aware chunks, and optionally returns normalized embedding vectors.
| Type | Notes |
|---|---|
Text extraction with pdfplumber. |
|
| DOCX | Reads paragraphs and table cells. |
| TXT | Tries UTF-8, UTF-8 BOM, Windows-1254, ISO-8859-9, Latin-1, and ISO-8859-1. |
| CSV | Detects common delimiters and handles UTF-16 BOM / NUL bytes. |
| XLSX / XLS | Reads spreadsheets into DataFrames. |
| ZIP | Extracts supported files, including password-protected AES ZIP archives. |
The current implementation requires CUDA when /process loads a model or tokenizer. Dockerfile.cpu and docker-compose.cpu.yml exist, but app.config.get_device() raises an error if CUDA is unavailable. In practice, use the GPU Docker setup for processing requests.
- Sentence-aware chunking with tokenizer-based size checks.
- Turkish and English abbreviation handling.
- Decimal number protection, such as
3.14. - Configurable chunk overlap.
- Optional text normalization:
none,lowercase,uppercase. - Row-level CSV and Excel chunking.
- Optional row metadata attachment for tabular files.
- BGE and E5 query/passage prefix support.
- Model caching through
ModelManager. - JSON, base64 NPY, and ZIP response formats.
- Bearer token authentication through
CHUNKING_AUTH_TOKEN. - File size, timeout, and concurrency limits.
- Structured JSON logging and consistent error responses.
Start the GPU service:
cp .env.example .env
docker compose up -d --buildWith the provided .env, the service is available at:
http://localhost:8001
Health check:
curl http://localhost:8001/healthList models:
curl http://localhost:8001/modelsProcess a document:
curl -X POST http://localhost:8001/process \
-H "Authorization: Bearer changeme-chunking-token" \
-F "files=@document.pdf" \
-F "model_name=BAAI/bge-m3" \
-F "max_tokens=512" \
-F "overlap=100"Return chunks without embedding arrays:
curl -X POST http://localhost:8001/process \
-H "Authorization: Bearer changeme-chunking-token" \
-F "files=@document.pdf" \
-F "skip_embedding=true"Processes uploaded files and returns chunks with optional embeddings.
Authentication is required only when CHUNKING_AUTH_TOKEN is set.
Common form parameters:
| Parameter | Default | Description |
|---|---|---|
files |
Required | One or more uploaded files. |
model_name |
First configured model | Model key from app/config.py. |
normalization |
none |
none, lowercase, or uppercase. |
max_tokens |
512 |
Maximum tokens per chunk. |
overlap |
100 |
Approximate token overlap between chunks. |
skip_embedding |
false |
Return chunks only. |
response_format |
json |
json, json_with_embeddings, or zip. |
embedding_batch_size |
32 |
Batch size for model encoding. |
prefix_mode |
passage |
passage or query for BGE/E5 style models. |
zip_password |
null |
Password for encrypted ZIP inputs. |
output_text_column |
Empty string | Use one table column as chunk text when present. |
include_column_names |
true |
Include column names in table chunk text. |
attach_row_data |
true |
Add source row data to metadata. |
Response shape:
{
"results": [
{
"filename": "document.pdf",
"chunks": [
{
"text": "First sentence. Second sentence.",
"metadata": {
"source_file": "document.pdf",
"chunk_index": 0,
"token_count": 128,
"sentence_aware": true,
"embedding_model_id": "BAAI/bge-m3"
}
}
],
"embeddings": [[0.012, -0.034]]
}
],
"total_chunks": 1,
"model_name": "BAAI/bge-m3",
"processing_time_seconds": 1.42
}Returns all configured embedding models and the default model.
Returns service status, active request count, waiting request count, and loaded model names.
| Model key | Dimension | Max length | Language |
|---|---|---|---|
all-MiniLM-L6-v2 |
384 | 256 | English |
paraphrase-multilingual-MiniLM-L12-v2 |
384 | 128 | Multilingual |
all-mpnet-base-v2 |
768 | 384 | English |
bge-base-en-v1.5 |
768 | 512 | English |
bge-large-en-v1.5 |
1024 | 512 | English |
BAAI/bge-m3 |
1024 | 8192 | Multilingual |
multilingual-e5-large |
1024 | 512 | Multilingual |
e5-base-v2 |
768 | 512 | English |
e5-large-v2 |
1024 | 512 | English |
For Turkish or multilingual workloads, BAAI/bge-m3 and multilingual-e5-large are the most relevant starting points.
The service reads settings from environment variables. Docker Compose loads .env.
| Variable | Default | Description |
|---|---|---|
API_PORT |
8000 |
Container application port. |
WORKERS |
1 |
Uvicorn worker count in Docker. |
LOG_LEVEL |
INFO |
Logging level. |
MAX_TOKENS |
512 |
Default chunk size. |
OVERLAP |
100 |
Default chunk overlap. |
MAX_FILE_SIZE_MB |
100 |
Maximum uploaded file size. |
MAX_FILES |
10 |
Maximum files per request. |
MAX_TABLE_ROWS |
500000 |
Maximum CSV/Excel rows. |
MAX_CONCURRENT_REQUESTS |
2 |
Concurrent embedding limit. |
REQUEST_TIMEOUT_SECONDS |
1800 |
Request timeout. |
CONCURRENCY_ACQUIRE_TIMEOUT |
30 |
Wait time for an embedding slot. |
PRELOAD_MODELS |
Empty | Comma-separated model keys loaded at startup. |
CHUNKING_AUTH_TOKEN |
Empty | Bearer token. Empty means auth is disabled. |
CUDA_VISIBLE_DEVICES |
Runtime-dependent | Visible CUDA device selection. |
Example:
API_PORT=8001
WORKERS=1
MAX_TOKENS=512
OVERLAP=100
MAX_FILE_SIZE_MB=100
MAX_FILES=10
MAX_CONCURRENT_REQUESTS=2
PRELOAD_MODELS=BAAI/bge-m3
CUDA_VISIBLE_DEVICES=0
CHUNKING_AUTH_TOKEN=change-this-tokenDo not commit real Hugging Face tokens or production API tokens.
Upload files
-> FileProcessor
-> TextNormalizer / TableProcessor
-> Chunker
-> ModelManager
-> EmbeddingEngine
-> ResponseFormatter
Main modules:
| Module | Purpose |
|---|---|
app/api/router.py |
Defines /process, /models, and /health. |
app/services/file_processor.py |
Reads supported document formats. |
app/services/chunker.py |
Creates sentence-aware chunks. |
app/services/table_processor.py |
Converts tables into row-level chunks. |
app/services/model_manager.py |
Loads and caches models/tokenizers. |
app/services/embedding.py |
Generates embeddings. |
app/services/response_formatter.py |
Builds JSON, base64 NPY, and ZIP responses. |
app/middleware/* |
Handles file size checks, timeouts, and errors. |
json: returns embeddings as nested numeric arrays.json_with_embeddings: returns embeddings as base64-encoded NPY data.zip: returnsresults.jsonplus one.npyfile per processed input.
Common error codes:
| Status | Code | Meaning |
|---|---|---|
| 400 | INVALID_PARAMETER |
Request validation failed. |
| 400 | INVALID_FILE |
File could not be parsed. |
| 400 | ZIP_PASSWORD_REQUIRED |
ZIP requires a password. |
| 400 | ZIP_WRONG_PASSWORD |
ZIP password is incorrect. |
| 408 | REQUEST_TIMEOUT |
Request exceeded timeout. |
| 413 | FILE_TOO_LARGE |
File or request is too large. |
| 422 | UNSUPPORTED_FORMAT |
File extension is not supported. |
| 422 | TOO_MANY_ROWS |
Table row limit exceeded. |
| 429 | TOO_MANY_REQUESTS |
Concurrency limit is saturated. |
| 500 | INTERNAL_ERROR |
Unexpected server error. |
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
uvicorn app.main:app --host 0.0.0.0 --port 8001 --reloadFor full /process functionality, use a CUDA-enabled environment and install a compatible PyTorch build.