Financial-NLP-System

L00172671 - Oisin Gibson

Getting Started

Prerequisites

Node.js v18 or higher
Python 3.8 or higher (must be on your system PATH)

First-time setup

Clone the repo, then run:

npm run setup

This downloads and configures everything automatically:

Apache Tika (PDF text extraction)
Java Runtime (required by Tika)
Tesseract OCR (for scanned PDFs — optional, app works without it)
Python virtual environment + all NLP packages
Node.js dependencies for all packages

Running the app

npm start

This starts all four services together:

Service	URL
React client	http://localhost:3000
API server	http://localhost:8080
NLP microservice	http://localhost:8000
Tika (PDF extraction)	http://localhost:9998

Default accounts

These are created automatically on first run:

Email	Password	Role
admin@achilles.com	Admin@123	Admin
demo@achilles.com	Demo@123	User

Project Overview

Financial document management and NLP analysis system with:

React client for upload, document browsing, and NLP UI
Node/Express API for auth, document storage, and processing orchestration
Python NLP microservice for extraction and analysis
Local runtime dependencies for Java/Tika/Tesseract OCR

Tidy File Map (Readable)

This section focuses on maintained source/config files. Large generated/runtime/vendor folders are summarized at the end for readability.

Root

package.json: Workspace-level scripts/dependencies.
package-lock.json: Workspace dependency lock file.
README.md: Project documentation.

scripts

scripts/checkSetup.js: Environment/setup validation helper.
scripts/setup.js: Local setup bootstrap script.
scripts/startTika.js: Starts Apache Tika runtime.

nlp_service

nlp_service/main.py: Python NLP microservice entrypoint.
nlp_service/requirements.txt: Python package requirements.

server

server/.env.example: Example environment variables.
server/package.json: Server scripts/dependencies.
server/package-lock.json: Server dependency lock file.
server/server.js: Express server bootstrap and route mounting.
server/createAdmin.js: Creates default admin user.
server/pdf-diagnostic.js: PDF diagnostics utility.
server/tika-config.xml: Tika OCR configuration.
server/contracts/nlpResults.json: NLP result contract/schema.

server/models

server/models/User.js: User model, auth helpers, password hashing.
server/models/Document.js: Document model and NLP-related fields.

server/middleware

server/middleware/auth.js: JWT authentication middleware.

server/routes

server/routes/auth.js: Authentication endpoints.
server/routes/users.js: Admin/user management endpoints.
server/routes/documents.js: Document route aggregator.
server/routes/documents/documentCrudRoutes.js: Document CRUD endpoints.
server/routes/documents/uploadRoutes.js: Upload endpoints.
server/routes/documents/nlpRoutes.js: NLP endpoints.
server/routes/documents/nlpProcessing.js: NLP processing flow helpers.
server/routes/documents/helpers.js: Shared document-route utilities.

server/services

server/services/nlpProcessor.js: Core NLP extraction/processing logic.
server/services/nlpMicroservice.js: Integration with Python NLP microservice.

server/tests

server/tests/auth.test.js: Authentication tests.
server/tests/users.test.js: User/admin route tests.
server/tests/documents.test.js: Document route tests.
server/tests/auditFlags.test.js: Audit flag behavior tests.
server/tests/nerAccuracy.test.js: NER quality/accuracy tests.

client

client/package.json: Client scripts/dependencies.
client/package-lock.json: Client dependency lock file.

client/public

client/public/index.html: HTML entry page.
client/public/manifest.json: PWA metadata.
client/public/images/logo.png: App logo asset.
client/public/images/Outlook-ixaxuupp.jpg: UI image asset.

client/src

client/src/index.js: React app bootstrap.
client/src/App.js: Route setup and top-level app layout.
client/src/config.js: API URL config.
client/src/index.css: Global styling.
client/src/App.css: App-level styling.
client/src/reportWebVitals.js: Web vitals helper.
client/src/setupTests.js: Test setup.

client/src/utils

client/src/utils/alertUtils.js: Alert formatting/helpers.
client/src/utils/documentUtils.js: Document format/download/group helpers.
client/src/utils/fileUtils.js: File validation/upload helpers.

Large Runtime/Generated Areas (Summarized)

These exist in the repo but are intentionally not expanded line-by-line here to keep this README readable:

client/build: Built frontend artifacts and source maps.
server/uploads: Uploaded document files.
server/lib: JAR dependencies for OCR/image support.
runtimes: Bundled Java/Tika/Tesseract runtime binaries and docs.
nlp_service/venv: Python virtual environment and installed packages.
server/database.sqlite: Local SQLite database file.

======================================================================================================================================= Reference Material

Tokenization Concepts
- Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press.
Stopword Removal
- Common English stopwords list based on NLTK (Natural Language Toolkit)
- Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. O'Reilly Media.

PDF Processing

pdf-parse Library
- GitHub: https://github.com/modesty/pdf-parse
- Uses Mozilla's PDF.js for parsing

Tika, OCR, and Tesseract

Apache Tika
- Download: https://tika.apache.org/download.html
- Server: https://cwiki.apache.org/confluence/display/TIKA/TikaServer
Tesseract OCR (Windows builds)
- Downloads: https://github.com/UB-Mannheim/tesseract/wiki
Tesseract OCR (Official)
- Project: https://github.com/tesseract-ocr/tesseract

JPEG2000 OCR Support (Scanned PDFs)

Some scanned PDFs use JPEG2000 (JP2) images. To OCR these, add the JAI Image I/O JARs:

Download:
- jai-imageio-core-*.jar
- jai-imageio-jpeg2000-*.jar
Place both files in server/lib
Restart Tika (npm run tika)

React (W3Schools): https://www.w3schools.com/react/
SQL (W3Schools): https://www.w3schools.com/sql/
Node.js (GeeksforGeeks): https://www.geeksforgeeks.org/nodejs/
Express.js (GeeksforGeeks): https://www.geeksforgeeks.org/express-js/
JWT (GeeksforGeeks): https://www.geeksforgeeks.org/json-web-token-jwt/
bcrypt (GeeksforGeeks): https://www.geeksforgeeks.org/bcrypt-hashing-in-nodejs/

Web Development Frameworks

React Documentation
- Official Docs: https://react.dev/
- React Hooks: https://react.dev/reference/react
Express.js - Web framework for Node.js
- Official Guide: https://expressjs.com/
Sequelize ORM
- Documentation: https://sequelize.org/docs/v6/

UI/UX Design

Bootstrap 5
- Documentation: https://getbootstrap.com/docs/5.0/
- Icons: https://icons.getbootstrap.com/
Component-Based Architecture
- Fowler, M. (2003). "Patterns of Enterprise Application Architecture"

Authentication & Security

JSON Web Tokens (JWT)
- jwt.io: https://jwt.io/introduction
bcrypt - Password hashing
- GitHub: https://github.com/kelektiv/node.bcrypt.js

File Upload Handling

Multer - Node.js middleware for multipart/form-data
- Documentation: https://github.com/expressjs/multer

Data Retention & Scheduling

node-cron - Scheduled jobs in Node.js
- Documentation: https://www.npmjs.com/package/node-cron

File Upload Security (Validation & Sanitization)

OWASP File Upload Cheat Sheet
- Guidance: https://cheatsheetseries.owasp.org/cheatsheets/File_Upload_Cheat_Sheet.html
file-type - Detect file signature (magic bytes)
- Documentation: https://www.npmjs.com/package/file-type

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.vscode		.vscode
client		client
nlp_service		nlp_service
runtimes		runtimes
scripts		scripts
server		server
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
TESTING_README.md		TESTING_README.md
package-lock.json		package-lock.json
package.json		package.json

Folders and files

Latest commit

History

Repository files navigation

Financial-NLP-System

Getting Started

Prerequisites

First-time setup

Running the app

Default accounts

Project Overview

Tidy File Map (Readable)

Root

scripts

nlp_service

server

server/models

server/middleware

server/routes

server/services

server/tests

client

client/public

client/src

client/src/components (core)

client/src/components/adminPanel

client/src/components/documents

client/src/components/documents/documentCard

client/src/components/login

client/src/components/nlp

client/src/hooks

client/src/utils

Large Runtime/Generated Areas (Summarized)

======================================================================================================================================= Reference Material

PDF Processing

Tika, OCR, and Tesseract

JPEG2000 OCR Support (Scanned PDFs)

Web Development Frameworks

UI/UX Design

Authentication & Security

File Upload Handling

Data Retention & Scheduling

File Upload Security (Validation & Sanitization)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages