This work implements an end-to-end automated trading prototype that integrates natural language sentiment extraction with time-series price modeling for crypto/stock assets. The pipeline begins with large-scale tweet collection (~1.5M items) using a scraping framework (twint), followed by preprocessing, tokenization, and sentiment classification with a fine-tuned BERT transformer. The sentiment features are temporally aligned with OHLCV and technical indicators to produce a fused feature space for downstream modeling. The system architecture includes modular scripts for data ingestion, feature engineering, model training, and a Flask-based bot capable of live inference and simulated trade execution.
The modeling stack explores multiple paradigms: (1) classical ML baselines (RandomForestRegressor, Logistic Regression, SVM) to benchmark signal strength (2) A deep sequence model (LSTM/GRU) trained on concatenated sentiment-price sequences to capture temporal dependencies (3) hybrid pipelines where transformer-derived sentiment embeddings are aggregated into fixed-length windows and merged with technical features before being passed to the LSTM. The project also includes experimentation with feed-forward neural nets and gradient boosting to compare sample efficiency. Models are evaluated on predictive accuracy (classification F1 / regression RMSE), as well as strategy-level backtest metrics (cumulative returns, Sharpe ratio, max drawdown).
Empirical evaluation shows that incorporating sentiment features reduces error metrics and improves directional accuracy compared to price-only models. The LSTM with BERT-based sentiment inputs consistently outperformed baselines, highlighting the predictive value of social signals during high-volatility intervals. The repository provides not only model notebooks but also a structured codebase (structured_code/) for reproducibility, checkpointed models (Models-Trade Bot/), and a Flask API for integrating predictions into a live trading loop. Limitations include noise sensitivity of sentiment features and challenges with overfitting on small windows. Future extensions could involve domain-specific transformers (FinBERT), probabilistic forecasting (Bayesian LSTM, quantile regression), and reinforcement learning agents for direct policy optimization.