A high‑performance parallel pipeline for processing large collections of JSON news articles. The program performs multi‑stage processing including parsing, reduction, aggregation, and structured output generation. The design uses parallelism only where it provides measurable improvements.
JSON files are placed into a LinkedBlockingQueue, and each worker thread repeatedly
pulls and parses the next available file:
while ((file = queue.poll()) != null) {
parseSingleFile(file);
}This ensures dynamic load balancing and prevents idle threads when some files are larger than others.
After the parsing phase ends, global maps and the list of articles are merged by thread 0. This merge is extremely fast and ensures deterministic output.
Each thread processes a disjoint segment of the global list of articles:
- uniqueness check (
uuid+title) - language classification
- normalized categories
- keyword extraction using ThreadLocal objects to avoid allocations
Output files are written serially:
all_articles.txt<language>.txt<category>.txtkeywords_count.txtreports.txt
Serial writing avoids filesystem thrashing and gives the most stable performance.
src/
Articles/ # input JSON files
files/ # languages.txt, categories.txt, linking words
test/ # test/articles.txt + test/inputs.txt
*.java # source code
Makefile # build & run
lib/
jackson-core-2.15.2.jar
cd src
make clean
make buildmake run ARGS="<threads> <articles.txt> <inputs.txt>"Example:
make run ARGS="4 test/articles.txt test/inputs.txt"The file must start with the number of JSON article files,
followed by paths relative to src/:
1101
Articles/article_1.json
Articles/article_2.json
...
3
files/languages.txt
files/categories.txt
files/english_linking_words.txt
You can benchmark manually using:
time make run ARGS="1 test/articles.txt test/inputs.txt"
time make run ARGS="2 test/articles.txt test/inputs.txt"
time make run ARGS="4 test/articles.txt test/inputs.txt"or using a small script:
for p in 1 2 4; do
echo "Threads: $p"
make run ARGS="$p test/articles.txt test/inputs.txt"
doneOr you could use the cheker:
bash checker/checker.sh test_1The Makefile must provide:
- a
buildtarget - a
runtarget that acceptsARGS="p articles inputs"
Output files must be generated in the current working directory.
PARSE (parallel, work-stealing)
↓ barrier
REDUCE PARSE (serial)
↓ barrier
AGGREGATION (parallel)
↓ barrier
REDUCE AGG (serial)
↓ barrier
WRITE OUTPUT (serial)
This approach provides:
- high speedup on compute-bound phases
- deterministic output
- minimal memory churn
- stable I/O behavior through serial writes
- full CPU occupancy during parse and aggregation
- no artificial delays (
sleep, busy waiting) - no repeated thread creation
ThreadLocalbuffers for tokenization and deduplication- Jackson Streaming API for fast JSON parsing
- dynamic load balancing during parse
- only serial access to shared global structures
https://github.com/Horicuz/ArticlesAgregator
Parallel JSON news article processor using work-stealing parsing,
parallel aggregation, and deterministic serial reduction (Java 8).
Open-source, educational/demo parallel processing pipeline.