Skip to content

A parallel and distributed implementation of the music recommendation system, designed to efficiently process large datasets and provide fast, scalable recommendations using optimized parallel algorithms.

Notifications You must be signed in to change notification settings

Horicuz/ArticlesAgregator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Articles Aggregator – Parallel JSON Processing

A high‑performance parallel pipeline for processing large collections of JSON news articles. The program performs multi‑stage processing including parsing, reduction, aggregation, and structured output generation. The design uses parallelism only where it provides measurable improvements.


🚀 Features

✓ Parallel Work-Stealing Parsing

JSON files are placed into a LinkedBlockingQueue, and each worker thread repeatedly pulls and parses the next available file:

while ((file = queue.poll()) != null) {
    parseSingleFile(file);
}

This ensures dynamic load balancing and prevents idle threads when some files are larger than others.

✓ Deterministic Serial Reduction

After the parsing phase ends, global maps and the list of articles are merged by thread 0. This merge is extremely fast and ensures deterministic output.

✓ Parallel Aggregation

Each thread processes a disjoint segment of the global list of articles:

  • uniqueness check (uuid + title)
  • language classification
  • normalized categories
  • keyword extraction using ThreadLocal objects to avoid allocations

✓ Fast Serial Output Generation (I/O-bound)

Output files are written serially:

  • all_articles.txt
  • <language>.txt
  • <category>.txt
  • keywords_count.txt
  • reports.txt

Serial writing avoids filesystem thrashing and gives the most stable performance.


📁 Project Structure

src/
   Articles/            # input JSON files
   files/               # languages.txt, categories.txt, linking words
   test/                # test/articles.txt + test/inputs.txt
   *.java               # source code
   Makefile             # build & run
lib/
   jackson-core-2.15.2.jar

🔧 Build & Run

Compile:

cd src
make clean
make build

Run:

make run ARGS="<threads> <articles.txt> <inputs.txt>"

Example:

make run ARGS="4 test/articles.txt test/inputs.txt"

📝 Input Format

articles.txt

The file must start with the number of JSON article files, followed by paths relative to src/:

1101
Articles/article_1.json
Articles/article_2.json
...

inputs.txt

3
files/languages.txt
files/categories.txt
files/english_linking_words.txt

🧪 Local Testing

You can benchmark manually using:

time make run ARGS="1 test/articles.txt test/inputs.txt"
time make run ARGS="2 test/articles.txt test/inputs.txt"
time make run ARGS="4 test/articles.txt test/inputs.txt"

or using a small script:

for p in 1 2 4; do
  echo "Threads: $p"
  make run ARGS="$p test/articles.txt test/inputs.txt"
done

🔍 Checker Usage

Or you could use the cheker:

bash checker/checker.sh test_1

The Makefile must provide:

  • a build target
  • a run target that accepts ARGS="p articles inputs"

Output files must be generated in the current working directory.


⚙️ Parallel Architecture Summary

PARSE (parallel, work-stealing)
      ↓ barrier
REDUCE PARSE (serial)
      ↓ barrier
AGGREGATION (parallel)
      ↓ barrier
REDUCE AGG (serial)
      ↓ barrier
WRITE OUTPUT (serial)

This approach provides:

  • high speedup on compute-bound phases
  • deterministic output
  • minimal memory churn
  • stable I/O behavior through serial writes
  • full CPU occupancy during parse and aggregation

🧹 Implementation Notes

  • no artificial delays (sleep, busy waiting)
  • no repeated thread creation
  • ThreadLocal buffers for tokenization and deduplication
  • Jackson Streaming API for fast JSON parsing
  • dynamic load balancing during parse
  • only serial access to shared global structures

🔗 Repository

https://github.com/Horicuz/ArticlesAgregator

📌 Short Description (for GitHub)

Parallel JSON news article processor using work-stealing parsing, 
parallel aggregation, and deterministic serial reduction (Java 8).

© License

Open-source, educational/demo parallel processing pipeline.

About

A parallel and distributed implementation of the music recommendation system, designed to efficiently process large datasets and provide fast, scalable recommendations using optimized parallel algorithms.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published