Articles Aggregator – Parallel JSON Processing

A high‑performance parallel pipeline for processing large collections of JSON news articles. The program performs multi‑stage processing including parsing, reduction, aggregation, and structured output generation. The design uses parallelism only where it provides measurable improvements.

🚀 Features

✓ Parallel Work-Stealing Parsing

JSON files are placed into a LinkedBlockingQueue, and each worker thread repeatedly pulls and parses the next available file:

while ((file = queue.poll()) != null) {
    parseSingleFile(file);
}

This ensures dynamic load balancing and prevents idle threads when some files are larger than others.

✓ Deterministic Serial Reduction

After the parsing phase ends, global maps and the list of articles are merged by thread 0. This merge is extremely fast and ensures deterministic output.

✓ Parallel Aggregation

Each thread processes a disjoint segment of the global list of articles:

uniqueness check (uuid + title)
language classification
normalized categories
keyword extraction using ThreadLocal objects to avoid allocations

✓ Fast Serial Output Generation (I/O-bound)

Output files are written serially:

all_articles.txt
<language>.txt
<category>.txt
keywords_count.txt
reports.txt

Serial writing avoids filesystem thrashing and gives the most stable performance.

📁 Project Structure

src/
   Articles/            # input JSON files
   files/               # languages.txt, categories.txt, linking words
   test/                # test/articles.txt + test/inputs.txt
   *.java               # source code
   Makefile             # build & run
lib/
   jackson-core-2.15.2.jar

🔧 Build & Run

Compile:

cd src
make clean
make build

Run:

make run ARGS="<threads> <articles.txt> <inputs.txt>"

Example:

make run ARGS="4 test/articles.txt test/inputs.txt"

📝 Input Format

`articles.txt`

The file must start with the number of JSON article files, followed by paths relative to src/:

1101
Articles/article_1.json
Articles/article_2.json
...

`inputs.txt`

3
files/languages.txt
files/categories.txt
files/english_linking_words.txt

🧪 Local Testing

You can benchmark manually using:

time make run ARGS="1 test/articles.txt test/inputs.txt"
time make run ARGS="2 test/articles.txt test/inputs.txt"
time make run ARGS="4 test/articles.txt test/inputs.txt"

or using a small script:

for p in 1 2 4; do
  echo "Threads: $p"
  make run ARGS="$p test/articles.txt test/inputs.txt"
done

🔍 Checker Usage

Or you could use the cheker:

bash checker/checker.sh test_1

The Makefile must provide:

a build target
a run target that accepts ARGS="p articles inputs"

Output files must be generated in the current working directory.

⚙️ Parallel Architecture Summary

PARSE (parallel, work-stealing)
      ↓ barrier
REDUCE PARSE (serial)
      ↓ barrier
AGGREGATION (parallel)
      ↓ barrier
REDUCE AGG (serial)
      ↓ barrier
WRITE OUTPUT (serial)

This approach provides:

high speedup on compute-bound phases
deterministic output
minimal memory churn
stable I/O behavior through serial writes
full CPU occupancy during parse and aggregation

🧹 Implementation Notes

no artificial delays (sleep, busy waiting)
no repeated thread creation
ThreadLocal buffers for tokenization and deduplication
Jackson Streaming API for fast JSON parsing
dynamic load balancing during parse
only serial access to shared global structures

🔗 Repository

https://github.com/Horicuz/ArticlesAgregator

📌 Short Description (for GitHub)

Parallel JSON news article processor using work-stealing parsing, 
parallel aggregation, and deterministic serial reduction (Java 8).

© License

Open-source, educational/demo parallel processing pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.idea		.idea
checker		checker
src		src
.DS_Store		.DS_Store
README.md		README.md
project.iml		project.iml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Articles Aggregator – Parallel JSON Processing

🚀 Features

✓ Parallel Work-Stealing Parsing

✓ Deterministic Serial Reduction

✓ Parallel Aggregation

✓ Fast Serial Output Generation (I/O-bound)

📁 Project Structure

🔧 Build & Run

Compile:

Run:

📝 Input Format

`articles.txt`

`inputs.txt`

🧪 Local Testing

🔍 Checker Usage

⚙️ Parallel Architecture Summary

🧹 Implementation Notes

🔗 Repository

📌 Short Description (for GitHub)

© License

About

Uh oh!

Releases

Packages

Languages

Horicuz/ArticlesAgregator

Folders and files

Latest commit

History

Repository files navigation

Articles Aggregator – Parallel JSON Processing

🚀 Features

✓ Parallel Work-Stealing Parsing

✓ Deterministic Serial Reduction

✓ Parallel Aggregation

✓ Fast Serial Output Generation (I/O-bound)

📁 Project Structure

🔧 Build & Run

Compile:

Run:

📝 Input Format

articles.txt

inputs.txt

🧪 Local Testing

🔍 Checker Usage

⚙️ Parallel Architecture Summary

🧹 Implementation Notes

🔗 Repository

📌 Short Description (for GitHub)

© License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`articles.txt`

`inputs.txt`

Packages