Skip to content

Orangewarrior/vec-eyes-lib

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

61 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Vec-Eyes Core πŸ§ πŸ”¬

Behavior intelligence engine for Rust

Vec-Eyes Core is a modular behavior classification engine written in Rust, designed to power detection systems across security, data science, and biological domains.

It combines lightweight built-in ML models, NLP feature extraction, external embedding support, and rule-based matching into a unified engine for pattern detection and classification.


πŸš€ What is Vec-Eyes Core?

Vec-Eyes Core is the engine behind Vec-Eyes CLI.

It provides:

  • 🧠 Machine Learning (KNN, Naive Bayes, Logistic Regression, SVM, Random Forest, Gradient Boosting, Isolation Forest)
  • πŸ”‘ NLP pipelines (Tokenization, TF-IDF, lightweight internal embeddings, external embeddings)
  • ⚑ Vector similarity (Word2Vec, FastText)
  • πŸ”Ž Rule engine (Regex / optional VectorScan)
  • πŸ“Š Hybrid scoring system

Production Notes

Vec-Eyes ships pure-Rust, dependency-light implementations that are useful for embedded classifiers, deterministic tests, security-oriented text features, and quick baselines. For high-stakes production ML, treat the built-in Word2Vec/FastText-style training and classical models as lightweight estimators rather than drop-in replacements for mature training stacks such as scikit-learn, XGBoost, or full fastText.

Recommended production path:

  • Use built-in Count/TF-IDF, KNN, Bayes, and rules for small or medium text-classification workloads.
  • Use external fastText or word2vec embeddings when semantic quality matters.
  • Evaluate every pipeline with held-out metrics before deploying; see src/metrics.rs for accuracy, F1, ROC-AUC, confusion matrix, and reports.
  • Load bincode models only from trusted sources. Prefer JSON or a controlled artifact pipeline for external model exchange.

Model Capability Matrix

Model Best Fit Notes
KNN Similarity search on small/medium corpora Brute-force exact search; cosine/euclidean paths cache training norms.
Bayes Fast lexical baselines Count and IDF-weighted modes; simple and explainable.
Logistic Regression Linear text classification One-vs-rest SGD-style trainer.
SVM Linear and approximate kernels RBF uses random Fourier features; polynomial/sigmoid use landmarks.
Random Forest Robust tabular-like dense features Supports balanced bootstrap, ExtraTrees-style splits, OOB score.
Gradient Boosting Small dense feature sets Uses shallow regression trees and honors max_depth.
Isolation Forest Anomaly scoring Best with enough representative cold/normal samples.

πŸ“š Documentation

Guide Description
Real Classification Examples Three end-to-end projects (security, biology, finance) β€” download a UCI dataset, train a classifier, build and run
Save & Load β€” Model Persistence JSON, bincode, and split-bincode formats; external fastText workflow with UCI dataset examples

Quick links by topic


🎯 Use Cases

πŸ” Security & Threat Detection

  • Spam classification
  • Phishing detection
  • Web attack identification (SQLi, XSS, fuzzing)
  • Malware behavior analysis
  • Log anomaly detection

πŸ’° Fraud Detection

  • Transaction anomaly detection
  • Behavioral fraud patterns
  • Suspicious activity classification

🧬 Biological & Scientific Analysis

Vec-Eyes Core can be adapted for:

  • Virus pattern classification
  • Human / biological signal classification
  • Bacteria and fungus pattern detection
  • Biomedical text/log classification

βš™οΈ Core Architecture

Input Text / Data
        ↓
Normalization / Tokenization
        ↓
Feature Extraction (TF-IDF / Embeddings)
        ↓
ML Engine (classifiers)
        ↓
Rule Engine (Regex / VectorScan)
        ↓
Hybrid Scoring
        ↓
Final Classification

🧠 Machine Learning

βœ” Multiple ML Algorithms:

  • KNN (Cosine, Euclidean, Manhattan, Minkowski)
  • Naive Bayes (Count, TF-IDF)
  • Logistic Regression
  • SVM (Linear, RBF, Polynomial, Sigmoid)
  • Random Forest (Standard, Balanced, ExtraTrees + OOB)
  • Gradient Boosting
  • Isolation Forest

πŸ”‘ NLP Pipeline

  • Tokenization
  • Normalization
  • TF-IDF
  • Word2Vec (lightweight training)
  • FastText-style embeddings (subword support)

πŸ”Ž Rule Engine

Vec-Eyes supports rule-based matching with scoring.

βœ” Default (no dependencies)

  • Regex-based matcher

βœ” Optional (feature flag)

  • VectorScan feature-gated backend. The current implementation keeps a safe regex-compatible fallback behind the feature until native scanning paths are wired.

πŸ”₯ YAML Pipeline

Vec-Eyes allows full pipeline definition via nested YAML. The current format groups data, NLP, model configuration, optional rules, and output settings explicitly.

Minimal KNN + FastText

report_name: Spam Security Classifier

data:
  hot_test_path: /data/email/spam/
  cold_test_path: /data/email/normal/
  hot_label: SPAM
  cold_label: RAW_DATA
  recursive_way: On
  score_sum: On

pipeline:
  nlp: FastText
  threads: 4

model:
  method: KnnCosine
  k: 5

extra_match:
  - engine: Regex
    path: /data/rules/spam_rules.txt
    score_add_points: 70
    title: Spam Keywords
    description: Detect common spam patterns

Bayes + TF-IDF

report_name: Fraud Narrative Classifier

data:
  hot_test_path: /data/fraud/hot/
  cold_test_path: /data/fraud/cold/
  hot_label: ANOMALY
  cold_label: RAW_DATA
  recursive_way: On
  score_sum: Off

pipeline:
  nlp: TfIdf
  threads: 2

model:
  method: Bayes

Random Forest + OOB

data:
  hot_test_path: /data/http/attacks/
  cold_test_path: /data/http/normal/
  hot_label: WEB_ATTACK
  cold_label: RAW_DATA

pipeline:
  nlp: TfIdf
  threads: 8

model:
  method: RandomForest
  n_trees: 200
  max_depth: 10
  mode: ExtraTrees
  max_features: Sqrt
  bootstrap: true
  oob_score: true

Gradient Boosting With Shallow Trees

data:
  hot_test_path: /data/fraud/high-risk/
  cold_test_path: /data/fraud/low-risk/
  hot_label: ANOMALY
  cold_label: RAW_DATA

pipeline:
  nlp: TfIdf
  threads: 4

model:
  method: GradientBoosting
  n_estimators: 100
  learning_rate: 0.1
  max_depth: 3

Isolation Forest

data:
  hot_test_path: /data/anomaly/known_outliers/
  cold_test_path: /data/anomaly/normal/
  hot_label: ANOMALY
  cold_label: RAW_DATA

pipeline:
  nlp: FastText
  threads: 4

model:
  method: IsolationForest
  n_trees: 150
  contamination: 0.02
  subsample_size: 256

Rust API Examples

KNN + FastText

use vec_eyes_lib::{ClassificationLabel, ClassifierFactory, ClassifierMethod, NlpOption};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let classifier = ClassifierFactory::builder()
        .method(ClassifierMethod::KnnCosine)
        .nlp(NlpOption::FastText)
        .hot_path("data/spam")
        .cold_path("data/normal")
        .hot_label(ClassificationLabel::Spam)
        .cold_label(ClassificationLabel::RawData)
        .k(5)
        .threads(Some(4))
        .build()?;

    let result = classifier.classify_text(
        "claim your free casino bonus now",
        vec_eyes_lib::ScoreSumMode::Off,
        &[],
    );
    println!("{result:?}");

    Ok(())
}

Bayes + TF-IDF

use vec_eyes_lib::{ClassificationLabel, ClassifierFactory, ClassifierMethod, NlpOption};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let classifier = ClassifierFactory::builder()
        .method(ClassifierMethod::Bayes)
        .nlp(NlpOption::TfIdf)
        .hot_path("data/fraud")
        .cold_path("data/normal")
        .hot_label(ClassificationLabel::Anomaly)
        .cold_label(ClassificationLabel::RawData)
        .threads(Some(2))
        .build()?;

    let result = classifier.classify_text(
        "urgent offshore transfer to anonymous account",
        vec_eyes_lib::ScoreSumMode::Off,
        &[],
    );
    println!("{result:?}");

    Ok(())
}

Compatibility Matrix

The matrix below is designed to make Vec-Eyes easier to understand and safer to configure. It summarizes what each classifier is best at, which NLP representations fit best, which parameters are required, and which non-trivial parameters strongly affect results.

Classifier Best NLP / Feature Input Required Parameters Important Non-Trivial Parameters Best Use Cases Notes
Bayes Count, TfIdf None threads Spam detection, fast baseline text classification, simple fraud text screening Very fast and stable. Best as a baseline. Not ideal for dense embeddings like Word2Vec/FastText.
KnnCosine FastText, Word2Vec k threads Similarity-based classification, noisy text, phishing, behavioral text matching Strong default for embedding-based text classification. Cosine is usually the best first KNN metric for dense vectors.
KnnEuclidean FastText, Word2Vec k threads Distance-based embedding experiments, attack clustering More sensitive to magnitude than cosine. Useful for controlled experiments.
KnnManhattan FastText, Word2Vec k threads Alternative distance profile for embeddings Often used for experimentation rather than as the default production KNN choice.
KnnMinkowski FastText, Word2Vec k, p threads Research-style distance tuning, anomaly/similarity experiments p changes the geometry of distance. Use only when you explicitly want to tune distance behavior.
LogisticRegression TfIdf, Count, sometimes dense embeddings logistic_learning_rate, logistic_epochs logistic_lambda, threads Fraud classification, text classification, strong production baseline Great balance between interpretability, speed, and quality. Very practical model.
SVM TfIdf, Count, sometimes dense embeddings svm_kernel, svm_c svm_learning_rate, svm_epochs, svm_gamma, svm_degree, svm_coef0, threads Security text classification, spam, fraud, web attack text Linear is usually the best first choice. Rbf is more expressive but more sensitive to tuning.
RandomForest TfIdf, FastText, structured-ish feature sets random_forest_n_trees random_forest_mode, random_forest_max_depth, random_forest_max_features, random_forest_min_samples_split, random_forest_min_samples_leaf, random_forest_bootstrap, random_forest_oob_score, threads Richer risk scoring, structured features, fraud, mixed-signal classification Good when you want ensembles and model diversity. Supports Standard, Balanced, and ExtraTrees.
GradientBoosting TfIdf, structured-ish feature sets n_estimators, learning_rate max_depth, threads Fraud/risk scoring, more expressive tabular-like classification Uses shallow regression trees and honors max_depth; more sensitive to hyperparameters than Bayes or Logistic Regression.
IsolationForest FastText, Word2Vec, anomaly-oriented feature sets isolation_forest_n_trees, isolation_forest_contamination isolation_forest_subsample_size, threads Anomaly detection, unusual behavior detection, outlier hunting Best when the goal is finding what looks abnormal rather than choosing among many known labels.

🧠 How Dataset Loading Works

  • hot directories β†’ labeled as target class
  • cold directories β†’ baseline / normal behavior
  • All files are read recursively
  • Multiple directories supported
  • Each file contributes to training vectors

πŸ”§ Optional VectorScan Support

Fedora

sudo dnf install boost-devel cmake gcc gcc-c++

Debian / Ubuntu

sudo apt install libboost-all-dev cmake build-essential
cargo build --features vectorscan

βš™οΈ Supported Methods

  • Bayes
  • KnnCosine
  • KnnEuclidean
  • KnnManhattan
  • KnnMinkowski (requires p)
  • LogisticRegression
  • RandomForest
  • Svm
  • GradientBoosting
  • IsolationForest

⚠️ Validation Rules

  • KNN requires k
  • Minkowski requires p
  • Logistic Regression requires positive learning_rate and epochs
  • Random Forest requires n_trees >= 1
  • SVM requires kernel and positive c
  • Gradient Boosting requires n_estimators >= 1 and positive learning_rate
  • Isolation Forest requires n_trees >= 1 and contamination in (0, 0.5)
  • YAML is validated before execution

⚑ Performance

  • Rust-native
  • Rayon parallelism
  • ndarray + BLAS ready

🧩 Embedding in Your Project

Use ClassifierFactory::builder() for dynamic construction, or ClassifierSpec / TypedClassifierBuilder when you want stronger compile-time guidance.


πŸ”— Relationship with CLI

  • vec-eyes-lib β†’ core engine
  • vec-eyes-cli β†’ interface layer

πŸ§ͺ Testing Guide

https://github.com/Orangewarrior/vec-eyes-lib/blob/main/tests/tests.md https://github.com/Orangewarrior/vec-eyes-lib/wiki/%F0%9F%A7%AA-Testing-Guide

🀝 Contributing

We welcome contributions in:

  • ML improvements
  • Performance optimization
  • Rule engine enhancements
  • Dataset integrations
  • Biological classification extensions

πŸ‘€ Author

Orangewarrior

If you like Vec-Eyes:

  • ⭐ Star the repo
  • πŸ’‘ Open issues
  • πŸ”§ Contribute

About

Vec-Eyes is a high-performance library for behavior classification engine in Rust, combining NLP, ML, vector embeddings, and rule-based matching.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages