Should I save the data into a file or can we send child data over worker? #19

arthurvanl · 2026-03-02T15:17:04Z

arthurvanl
Mar 2, 2026

Because I do get the data in JSON but it MAY be a lot of data.

Write file to disk when bigger than 20MB?
Data that is smaller than 20MB can it be used with fetching the getParentResult. Or should I add a reference file and return the file path so we can get it in the getParentResult data?

Mar 2, 2026

Hi @arthurvanl, great question!

Short answer: for large data (>1-2 MB), save to disk/S3 and pass the file path in the job result. For small data (<1 MB), passing it directly through getParentResult is fine.

Why?

Job results are stored in memory (LRU cache, max 5,000 entries) and optionally persisted to SQLite. Storing 20MB+ blobs per job would quickly exhaust memory and slow down SQLite operations.

Recommended pattern for large data:

import { writeFile, readFile } from 'fs/promises';
import { randomUUID } from 'crypto';

const worker = new Worker('fetch-data', async (job) => {
  const data = await fetchFromFTP(job.data.endpoint);
  
  if (JSON.stringify(data).length > 1_000_000) {
    // …

View full answer

egeominotti · 2026-03-02T15:32:10Z

egeominotti
Mar 2, 2026
Maintainer

Hi @arthurvanl, great question!

Short answer: for large data (>1-2 MB), save to disk/S3 and pass the file path in the job result. For small data (<1 MB), passing it directly through getParentResult is fine.

Why?

Job results are stored in memory (LRU cache, max 5,000 entries) and optionally persisted to SQLite. Storing 20MB+ blobs per job would quickly exhaust memory and slow down SQLite operations.

Recommended pattern for large data:

import { writeFile, readFile } from 'fs/promises';
import { randomUUID } from 'crypto';

const worker = new Worker('fetch-data', async (job) => {
  const data = await fetchFromFTP(job.data.endpoint);
  
  if (JSON.stringify(data).length > 1_000_000) {
    // Large data -> write to disk, return path
    const path = `/tmp/jobs/${randomUUID()}.json`;
    await writeFile(path, JSON.stringify(data));
    return { type: 'file', path };
  }
  
  // Small data -> return directly
  return { type: 'inline', data };
}, { embedded: true });

// Child worker reads parent result
const childWorker = new Worker('process-data', async (job) => {
  const parentResult = flow.getParentResult(job.data.__flowParentId);
  
  if (parentResult.type === 'file') {
    const data = JSON.parse(await readFile(parentResult.path, 'utf-8'));
    // process data...
  } else {
    // use parentResult.data directly
  }
}, { embedded: true });

For production, consider using S3/MinIO instead of local disk — bunqueue already has S3 backup support, and it scales better across instances.

Also note that with v2.5.8, FlowJobData fields like __flowParentId now have full IntelliSense in Worker callbacks (see #18).

0 replies

arthurvanl · 2026-03-03T12:52:50Z

arthurvanl
Mar 3, 2026
Author

A small addition to this @egeominotti

A job like that will be ran multiple times parralel from eachother based on how many customers there are.

Will this be a problem with using only 1 worker? And if not, what should be the best practice to make this work fast?
Do I add more workers? Or do I switch to a SpawnedWorker ?

0 replies

egeominotti · 2026-03-03T13:03:51Z

egeominotti
Mar 3, 2026
Maintainer

No, a single worker is not a problem — it depends on how you configure it.

A single Worker with concurrency > 1 will process multiple jobs in parallel within the same process:

const worker = new Worker('customer-jobs', async (job) => {
  const data = await fetchCustomerData(job.data.customerId);
  // process...
}, { concurrency: 10 }); // Processes 10 jobs in parallel

With concurrency: 10, 10 customer jobs will run simultaneously even with just 1 worker.

When to scale up

Approach	When to use
Increase `concurrency` on a single Worker	I/O-bound jobs (API calls, DB queries, file reads). This is usually the first thing to try.
Multiple Worker instances (same queue)	When you need to distribute across multiple processes or machines. Jobs are automatically distributed fairly via lock-based ownership.
`SandboxedWorker` (not `SpawnedWorker`)	CPU-heavy jobs that block the event loop, or when you need process isolation (crash protection, memory limits). Each job runs in its own Bun Worker thread.

Recommendation for your use case

Since you're fetching data per customer (I/O-bound), start with a single Worker with higher concurrency:

const worker = new Worker('fetch-data', processor, {
  concurrency: 20,  // 20 parallel customer fetches
  batchSize: 20,    // Pull 20 jobs at a time
});

If you need more throughput, you can spin up additional Worker instances on the same queue — they'll automatically share the workload.

Switch to SandboxedWorker only if the processing step is CPU-intensive (e.g., parsing large JSON, data transformation) and blocks the event loop.

1 reply

arthurvanl Mar 3, 2026
Author

So for the processing files of 200/300MB I should move to a SandboxedWorker right?

The main issue for me with SandBoxedWorker is that I don't have any events.

I atleast need active, failed & completed to be able to track the job properly. This allows us to know how much time a job took and blocks from adding new chained jobs which I will be added of the other job that is ran by using Queue#upsertJobScheduler

egeominotti · 2026-03-03T13:20:26Z

egeominotti
Mar 3, 2026
Maintainer

Great follow-up! Let me clarify a few things:

1. SandboxedWorker DOES have events

SandboxedWorker supports the same events as the regular Worker: active, completed, failed, progress, error, ready, and closed. See the SandboxedWorker Events docs.

import { SandboxedWorker } from 'bunqueue/client';

const worker = new SandboxedWorker('process-files', {
  processor: './processors/file-processor.ts',
  concurrency: 4,
  maxMemory: 512,    // 512MB per worker thread (default is 256)
  timeout: 300000,   // 5 min timeout for large files
});

worker.on('active', (job) => {
  console.log(`Job ${job.id} started at ${Date.now()}`);
});

worker.on('completed', (job, result) => {
  console.log(`Job ${job.id} completed`, result);
});

worker.on('failed', (job, error) => {
  console.error(`Job ${job.id} failed: ${error.message}`);
});

worker.on('progress', (job, progress) => {
  console.log(`Job ${job.id}: ${progress}%`);
});

await worker.start();

2. Do you actually need SandboxedWorker for 200-300MB files?

It depends on what you're doing with the files:

Reading/streaming (I/O-bound) → regular Worker with concurrency is fine and faster
Parsing/transforming large JSON/CSV in memory (CPU-bound) → SandboxedWorker is the right choice

If you're just reading from FTP and writing to disk/S3, a regular Worker is better.

3. upsertJobScheduler + chains: this won't work directly

upsertJobScheduler creates single jobs on a schedule. It does not create chains. So if you want a recurring chain, you need a pattern like this:

import { Queue, Worker, FlowProducer } from 'bunqueue/client';

const queue = new Queue('triggers', { embedded: true });
const flow = new FlowProducer({ embedded: true });

// 1. Schedule a "trigger" job every 60 seconds
await queue.upsertJobScheduler('customer-pipeline', {
  every: 60000,
}, {
  name: 'trigger',
  data: { customerId: 'abc123' },
});

// 2. Worker for the trigger job creates the actual chain
const triggerWorker = new Worker('triggers', async (job) => {
  await flow.addChain([
    {
      name: 'fetch-data',
      queueName: 'pipeline',
      data: { customerId: job.data.customerId },
    },
    {
      name: 'process-data',
      queueName: 'pipeline',
      data: { customerId: job.data.customerId },
    },
    {
      name: 'save-results',
      queueName: 'pipeline',
      data: { customerId: job.data.customerId },
    },
  ]);

  return { chainCreated: true };
}, { embedded: true });

// 3. SandboxedWorker (or regular Worker) processes the chain steps
const pipelineWorker = new SandboxedWorker('pipeline', {
  processor: './processors/pipeline.ts',
  concurrency: 4,
  maxMemory: 512,
  timeout: 300000,
});

pipelineWorker.on('completed', (job, result) => {
  console.log(`Step ${job.name} completed for customer ${job.data.customerId}`);
});

await pipelineWorker.start();

4. Preventing overlapping chains

There's no built-in mechanism to block new chains while a previous one is running. The scheduler will keep firing regardless. You can handle this in the trigger worker:

// Track running chains (in-memory, or use a dedicated queue/flag)
const runningChains = new Map<string, boolean>();

const triggerWorker = new Worker('triggers', async (job) => {
  const { customerId } = job.data;

  // Skip if chain is already running for this customer
  if (runningChains.get(customerId)) {
    console.log(`Chain already running for ${customerId}, skipping`);
    return { skipped: true };
  }

  runningChains.set(customerId, true);

  const result = await flow.addChain([
    { name: 'fetch', queueName: 'pipeline', data: { customerId } },
    { name: 'process', queueName: 'pipeline', data: { customerId } },
    { name: 'save', queueName: 'pipeline', data: { customerId } },
  ]);

  return { chainCreated: true, jobIds: result.jobIds };
}, { embedded: true });

// Clear the flag when the LAST step of the chain completes
pipelineWorker.on('completed', (job) => {
  if (job.name === 'save') {
    runningChains.delete(job.data.customerId);
  }
});

This way the scheduler keeps firing, but the trigger worker skips if a chain is still in progress for that customer.

1 reply

arthurvanl Mar 3, 2026
Author

I do have to parse the data. So I will switch to SandboxedWorker for this chain process. Awesome that those events are set for the SandboxedWorker!

The map that you made is actually set in my database. But I might switch to this example later on. Because you can also fetch the already runned jobs right? I just need a started_date & ended_date & a state of what happend to the job (ended, ended_with_error, or killed) by oom or whatever other reason it got killed.

I will use the upsertJobScheduler + chains like how you explained it. Thanks a lot!

arthurvanl · 2026-03-03T15:20:51Z

arthurvanl
Mar 3, 2026
Author

It seems SandboxedWorker is missing autocomplete for events.

1 reply

egeominotti Mar 3, 2026
Maintainer

Open issue

arthurvanl · 2026-03-03T20:51:21Z

arthurvanl
Mar 3, 2026
Author

I have another big thing which might be a problem when using a SandboxedWorker. I have 1 file where I have an export constant of all my services/controllers that I use. I use this inside the sandboxedworker. Is that a problem for memory maybe? Because I am trying to use something that is already used in the main process. If so, that would mean I make another controller/service inside the spawnedworker. Please let me know what's best for this!

2 replies

arthurvanl Mar 3, 2026
Author

Actually I found that it is an big issue.

In the picture you see that I import a constant called dataset. And when I start the app, it spams aa into the console. I put a console.log in index.ts where the queue and sandboxedworker are.

So I figured now that I cannot import anything that is a constant (i guess), Importing functions seems fine. So I think it's whenever you try to fetch from the heap it will break. So I will just keep that in mind.

arthurvanl Mar 4, 2026
Author

I have another issue with this actually. Because now I cannot update the job data inside the sanboxedworker for the sync id. So I need to use a workaround for now.

Should I save the data into a file or can we send child data over worker? #19

Uh oh!

arthurvanl Mar 2, 2026

Replies: 6 comments · 5 replies

Uh oh!

egeominotti Mar 2, 2026 Maintainer

Uh oh!

arthurvanl Mar 3, 2026 Author

Uh oh!

egeominotti Mar 3, 2026 Maintainer

When to scale up

Recommendation for your use case

Uh oh!

arthurvanl Mar 3, 2026 Author

Uh oh!

egeominotti Mar 3, 2026 Maintainer

1. SandboxedWorker DOES have events

2. Do you actually need SandboxedWorker for 200-300MB files?

3. upsertJobScheduler + chains: this won't work directly

4. Preventing overlapping chains

Uh oh!

arthurvanl Mar 3, 2026 Author

Uh oh!

arthurvanl Mar 3, 2026 Author

Uh oh!

egeominotti Mar 3, 2026 Maintainer

Uh oh!

arthurvanl Mar 3, 2026 Author

Uh oh!

arthurvanl Mar 3, 2026 Author

Uh oh!

arthurvanl Mar 4, 2026 Author

arthurvanl
Mar 2, 2026

Replies: 6 comments 5 replies

egeominotti
Mar 2, 2026
Maintainer

arthurvanl
Mar 3, 2026
Author

egeominotti
Mar 3, 2026
Maintainer

arthurvanl Mar 3, 2026
Author

egeominotti
Mar 3, 2026
Maintainer

arthurvanl Mar 3, 2026
Author

arthurvanl
Mar 3, 2026
Author

egeominotti Mar 3, 2026
Maintainer

arthurvanl
Mar 3, 2026
Author

arthurvanl Mar 3, 2026
Author

arthurvanl Mar 4, 2026
Author