Skip to content

Oisin003/Financial-NLP-System

Repository files navigation

Financial-NLP-System

L00172671 - Oisin Gibson

Getting Started

Prerequisites

  • Node.js v18 or higher
  • Python 3.8 or higher (must be on your system PATH)

First-time setup

Clone the repo, then run:

npm run setup

This downloads and configures everything automatically:

  • Apache Tika (PDF text extraction)
  • Java Runtime (required by Tika)
  • Tesseract OCR (for scanned PDFs — optional, app works without it)
  • Python virtual environment + all NLP packages
  • Node.js dependencies for all packages

Running the app

npm start

This starts all four services together:

Service URL
React client http://localhost:3000
API server http://localhost:8080
NLP microservice http://localhost:8000
Tika (PDF extraction) http://localhost:9998

Default accounts

These are created automatically on first run:

Email Password Role
admin@achilles.com Admin@123 Admin
demo@achilles.com Demo@123 User

Project Overview

Financial document management and NLP analysis system with:

  • React client for upload, document browsing, and NLP UI
  • Node/Express API for auth, document storage, and processing orchestration
  • Python NLP microservice for extraction and analysis
  • Local runtime dependencies for Java/Tika/Tesseract OCR

Tidy File Map (Readable)

This section focuses on maintained source/config files. Large generated/runtime/vendor folders are summarized at the end for readability.

Root

scripts

nlp_service

server

server/models

server/middleware

server/routes

server/services

server/tests

client

client/public

client/src

client/src/components (core)

client/src/components/adminPanel

client/src/components/documents

client/src/components/documents/documentCard

client/src/components/login

client/src/components/nlp

client/src/hooks

client/src/utils

Large Runtime/Generated Areas (Summarized)

These exist in the repo but are intentionally not expanded line-by-line here to keep this README readable:

======================================================================================================================================= Reference Material

  • Tokenization Concepts

    • Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press.
  • Stopword Removal

    • Common English stopwords list based on NLTK (Natural Language Toolkit)
    • Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. O'Reilly Media.

PDF Processing

Tika, OCR, and Tesseract

JPEG2000 OCR Support (Scanned PDFs)

Some scanned PDFs use JPEG2000 (JP2) images. To OCR these, add the JAI Image I/O JARs:

  1. Download:
    • jai-imageio-core-*.jar
    • jai-imageio-jpeg2000-*.jar
  2. Place both files in server/lib
  3. Restart Tika (npm run tika)

Web Development Frameworks

UI/UX Design

Authentication & Security

File Upload Handling

Data Retention & Scheduling

File Upload Security (Validation & Sanitization)

About

Final Year Project

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors