Content Lake App

AI-powered semantic search and RAG over Alfresco and Nuxeo content using hxpr

Features • Quick Start • Architecture • Authentication • API Usage • Configuration

Related Projects

alfresco-content-lake-ui - ACA-based frontend for semantic search and RAG over Content Lake.
content-lake-app-ui - Demo UI for the Content Lake App that provides dual authentication (Alfresco + Nuxeo)
content-lake-app-deployment - Docker Compose deployment for Alfresco, Nuxeo, hxpr, Content Lake services, and the UI.
nuxeo-deployment - Companion project that builds and runs the local Nuxeo server. Required when using compose.nuxeo.yaml in this repo.

Documentation

Doc	Contents
docs/architecture.md	Module layout, SPI interfaces, dependency graph, data model, design decisions
docs/sync-pipeline.md	Full/live sync flows, metadata-only path, path structure, idempotency, scope resolution

Overview

Proof of Concept for AI-powered semantic search and Retrieval-Augmented Generation (RAG) over Alfresco and Nuxeo content.

Leverages hxpr as a Content Lake to enable high-quality AI search while:

Keeping Alfresco and Nuxeo as the sources of truth
Enforcing server-side permissions via ACLs
Supporting on-premises AI execution
Minimizing data duplication

Features

Two-Phase Sync Pipeline: Fast metadata ingestion + async content processing
Near Real-Time Sync: Alfresco Event2 listener over ActiveMQ using the Alfresco Java SDK
Semantic Search: Vector embeddings with permission-aware kNN search
RAG: LLM-powered question answering grounded in Alfresco document content
Permission-Aware: Server-side ACL enforcement via hxpr
Local AI: On-premises LLM and embedding models using Spring AI
Repository Scope Model: cl:indexed and cl:excludeFromLake for Alfresco-native scope control
REST API: Generic connector using Alfresco REST APIs
Secured Endpoints: Alfresco authentication (username/password or tickets)
Shared Ingestion Core: Common metadata, transform, chunking, embedding, ACL, and delete/update logic in content-lake-core
Idempotent Coexistence: alfresco_modifiedAt guard prevents stale batch/live writes from overwriting newer content

Architecture

  ┌────────────────────────────────────┐   ┌────────────────────────────────────┐
  │ Alfresco Repository + Event2       │   │ Nuxeo + Audit Stream               │
  │ REST API + ActiveMQ topic          │   │ REST API + audit log watermark     │
  └────────────────────────────────────┘   └────────────────────────────────────┘
         │                   │                    │                   │
         ▼                   ▼                    ▼                   ▼
  ┌────────────┐    ┌──────────────┐    ┌──────────────────┐  ┌──────────────────┐
  │ alfresco-  │    │ alfresco-    │    │ nuxeo-batch-     │  │ nuxeo-live-      │
  │ batch-     │    │ live-        │    │ ingester         │  │ ingester         │
  │ ingester   │    │ ingester     │    │ NXQL Discovery   │  │ Audit Watermark  │
  └────────────┘    └──────────────┘    └──────────────────┘  └──────────────────┘
         │                   │                    │                   │
         └───────────────────┴────────────────────┴───────────────────┘
                                          ▼
                    ┌──────────────────────────────────────────┐
                    │ content-lake-core                        │
                    │ Node sync, Transform, Chunk, Embed, ACL  │
                    │ source_modifiedAt idempotency guard      │
                    └──────────────────────────────────────────┘
                                          ▼
                    ┌──────────────────────────────────────────┐
                    │ hxpr Content Lake                        │
                    └──────────────────────────────────────────┘
                                          ▼
                    ┌──────────────────────────────────────────┐
                    │ rag-service                              │
                    │ Query → Embed → Search → Augment → LLM   │
                    └──────────────────────────────────────────┘

Modules

Module	Group	Port	Description
`content-lake-repo-model`	`common/`	—	Alfresco repository JAR that bootstraps the `cl:indexed` content model for scope control
`content-lake-spi`	`common/`	—	Source Provider Interface: `SourceNode`, `ContentSourceClient`, `TextExtractor`, `ScopeResolver`
`content-lake-core`	`common/`	—	Shared ingestion pipeline: metadata sync, transform, chunking, embedding, ACL updates, idempotency
`rag-service`	`common/`	9091	Semantic search, hybrid search, and RAG question answering
`content-lake-source-alfresco`	`alfresco/`	—	Alfresco REST clients, scope resolver, and ACL expansion
`alfresco-batch-ingester`	`alfresco/`	9090	Alfresco folder discovery, batch scheduling, and `/api/sync/*` controllers
`alfresco-live-ingester`	`alfresco/`	9092	Alfresco Event2 listener over ActiveMQ using Alfresco Java SDK handlers
`content-lake-source-nuxeo`	`nuxeo/`	—	Nuxeo REST clients, scope resolver, auth abstraction, and text extraction
`nuxeo-batch-ingester`	`nuxeo/`	9093	Nuxeo full-batch discovery and one-shot sync using NXQL
`nuxeo-live-ingester`	`nuxeo/`	9094	Nuxeo audit-stream listener using a persisted watermark

Quick Start

Prerequisites

Java 21+ and Maven 3.9+
Docker and Docker Compose
Alfresco Content Services 25.x+
- Alfresco Transform Service (for text extraction)
hxpr Content Lake (with OAuth2 IDP)
Docker Model Runner (for embeddings and LLM)

Installation

# Clone repository
git clone https://github.com/aborroy/content-lake-app.git
cd content-lake-app

# Build all modules
mvn clean package

# Deploy the repository content model to ACS before starting the ingesters
# Artifact:
#   common/content-lake-repo-model/target/content-lake-repo-model-1.0.0-SNAPSHOT.jar
# Deploy it to the Alfresco Repository classpath.

# Configure (see Environment Variables below)
export ALFRESCO_URL=http://localhost:8080
export ALFRESCO_INTERNAL_USERNAME=admin
export ALFRESCO_INTERNAL_PASSWORD=admin
# ... (see full configuration below)

# Run batch ingestion
java -jar alfresco/alfresco-batch-ingester/target/alfresco-batch-ingester-1.0.0-SNAPSHOT.jar

# Run live ingestion
java -jar alfresco/alfresco-live-ingester/target/alfresco-live-ingester-1.0.0-SNAPSHOT.jar

# Run RAG service
java -jar common/rag-service/target/rag-service-1.0.0-SNAPSHOT.jar

# Or with Docker Compose (full stack)
cd ../content-lake-app-deployment && docker compose up --build

Alfresco Repo Model

The batch and live ingesters now rely on an Alfresco content model for scope control:

cl:indexed marks a folder subtree as in scope for Content Lake ingestion
cl:excludeFromLake lets a file opt out, or a folder subtree opt out, even when an ancestor folder is indexed

Build artifact:

common/content-lake-repo-model/target/content-lake-repo-model-1.0.0-SNAPSHOT.jar

Deploy that JAR to the Alfresco Repository classpath before enabling ingestion. Typical options are:

include it in an ACS SDK modules/platform build
copy or mount it into an Alfresco Repository image under webapps/alfresco/WEB-INF/lib

Starting From A Non-Indexed Repository

If your Alfresco Repository does not yet use cl:indexed, the recommended startup sequence is:

Build the project and deploy the repository model JAR to Alfresco Repository. After deployment, restart the repository so cl:indexed and cl:excludeFromLake are available.
Start batch-ingester.
Run a batch synchronization against the folder you want to onboard. The ingester automatically adds cl:indexed to each root folder if it is not already present, then performs the initial backfill into Content Lake.
Start live-ingester. Live ingestion then keeps that indexed subtree up to date.

Example for indexing all sites under Company Home/Sites:

Resolve the Alfresco node id for Company Home/Sites. You can obtain it from Alfresco UI tools or the Alfresco REST API.
Run the batch sync against that folder:

curl -X POST http://localhost:9090/api/sync/batch \
  -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{"folders":["SITES_FOLDER_NODE_ID"],"recursive":true,"types":["cm:content"]}'

This single call marks SITES_FOLDER_NODE_ID with cl:indexed (if needed) and ingests all existing content beneath it.

After the batch completes, start live-ingester so new or changed content under Company Home/Sites continues to sync automatically.

Important:

cl:indexed can also be set directly via the Alfresco Repository nodes API or the Content Lake UI extension; the batch ingester sets it automatically only for root folders passed in the request
cl:excludeFromLake on a folder removes that folder's full subtree from Content Lake scope; batch discovery skips it and live reconciliation deletes previously ingested descendants
if you later want to index only one site, pass that site folder to /api/sync/batch instead of Company Home/Sites

Environment Variables

# Alfresco (Internal Service Account)
export ALFRESCO_URL=http://localhost:8080
export ALFRESCO_INTERNAL_USERNAME=admin
export ALFRESCO_INTERNAL_PASSWORD=admin

# hxpr Content Lake
export HXPR_URL=http://localhost:8080
export HXPR_REPOSITORY_ID=default
export HXPR_IDP_TOKEN_URL=http://localhost:5002/idp/connect/token
export HXPR_IDP_CLIENT_ID=nuxeo-client
export HXPR_IDP_CLIENT_SECRET=secret
export HXPR_IDP_USERNAME=testuser
export HXPR_IDP_PASSWORD=password

# Transform Service (batch-ingester only)
export TRANSFORM_URL=http://localhost:10090
export TRANSFORM_ENABLED=true

# ActiveMQ / Event2 (live-ingester only)
export ACTIVEMQ_URL=tcp://localhost:61616
export ACTIVEMQ_USER=admin
export ACTIVEMQ_PASSWORD=admin
export ALFRESCO_EVENT_TOPIC=alfresco.repo.event2

# Nuxeo (Nuxeo ingesters + rag-service authority lookup)
export NUXEO_URL=http://localhost:8081/nuxeo
export NUXEO_USERNAME=Administrator
export NUXEO_PASSWORD=Administrator
export NUXEO_SOURCE_ID=local

# AI/Embeddings (both services)
# Spring AI appends /v1 itself; use the Docker Model Runner root URL.
export MODEL_RUNNER_URL=http://localhost:12434
export EMBEDDING_MODEL=ai/mxbai-embed-large

# LLM (rag-service only)
export LLM_MODEL=ai/gpt-oss
export LLM_TEMPERATURE=0.3
export LLM_MAX_TOKENS=1024

# RAG defaults (rag-service only)
export RAG_DEFAULT_TOP_K=5
export RAG_DEFAULT_MIN_SCORE=0.5
export RAG_MAX_CONTEXT_LENGTH=12000

# Performance (batch-ingester only)
export TRANSFORM_WORKERS=4
export EMBEDDING_CHUNK_SIZE=900
export EMBEDDING_CHUNK_OVERLAP=120

Nuxeo Backfill And Live Sync

compose.nuxeo.yaml starts both the nuxeo-batch-ingester and the nuxeo-live-ingester. The Nuxeo server itself is provided by the nuxeo-deployment companion project, which must be running before you start this stack.

Step 1 — start Nuxeo (in a separate terminal, from the sibling nuxeo-deployment/ directory):

git clone https://github.com/aborroy/nuxeo-deployment.git ../nuxeo-deployment
cd ../nuxeo-deployment
docker compose up --build

Nuxeo will be available at http://localhost:8081/nuxeo once healthy.

Step 2 — start the Nuxeo ingesters (from this directory):

docker compose -f compose.nuxeo.yaml up --build

This starts:

nuxeo-batch-ingester on http://localhost:9093 for one-shot backfills
nuxeo-live-ingester on http://localhost:9094 for audit-driven incremental sync

Defaults:

Nuxeo credentials: Administrator / Administrator
Discovery mode: NXQL
Included roots: /default-domain/workspaces
Included types: File, Note

Trigger a full configured backfill:

curl -X POST http://localhost:9093/api/sync/configured \
  -u Administrator:Administrator

Trigger a custom backfill with request overrides:

curl -X POST http://localhost:9093/api/sync/batch \
  -u Administrator:Administrator \
  -H "Content-Type: application/json" \
  -d '{
    "includedRoots": ["/default-domain/workspaces"],
    "includedDocumentTypes": ["File", "Note"],
    "excludedLifecycleStates": ["deleted"],
    "pageSize": 50,
    "discoveryMode": "NXQL"
  }'

Check status:

curl http://localhost:9093/api/sync/status -u Administrator:Administrator
curl http://localhost:9093/api/sync/status/{jobId} -u Administrator:Administrator

The live listener has no manual sync API. Use the actuator endpoints for health and metrics:

curl http://localhost:9094/actuator/health
curl http://localhost:9094/actuator/metrics

When using the deployment repo's reverse proxy, the public sync API remains /api/sync/*. Route to Nuxeo by adding ?sourceType=nuxeo; omit it or use alfresco for the existing Alfresco ingester.

Authentication

REST API authentication is source-specific:

Alfresco ingesters validate incoming credentials or tickets against Alfresco.
nuxeo-batch-ingester uses HTTP Basic auth with the configured Nuxeo service credentials.
nuxeo-live-ingester does not expose sync APIs; health and metrics come from Spring Actuator.

Supported Methods

Method	Example
Basic Auth	`curl -u admin:password http://localhost:9090/api/sync/status`
Ticket (query)	`curl "http://localhost:9090/api/sync/status?alf_ticket=TICKET_xxx"`
Ticket (header)	`curl -H "Authorization: Basic BASE64(TICKET_xxx)" ...`

Note: Bearer token authentication (OAuth2/OIDC with Keycloak) is not yet supported.

Source-Native ACL Filtering

Current mixed-source filtering keeps Alfresco and Nuxeo principals source-native:

Ingested ACLs are written to hxpr with the source instance suffix _#_<sourceId>.
Alfresco and Nuxeo principals are not normalized to a shared identity yet.
rag-service expands Alfresco groups from Alfresco and Nuxeo groups from Nuxeo, then applies them only to matching source IDs.
Alfresco repository admins keep repository-admin discoverability for Alfresco sources without storing synthetic admin ACEs in sys_acl.
This mode assumes the authenticated username is the same login string in each source you want to query.
Nuxeo group expansion in rag-service uses the configured NUXEO_USERNAME and NUXEO_PASSWORD service credentials to read /api/v1/user/{username}.

Quick Example

# Authenticate and start sync
curl -X POST http://localhost:9090/api/sync/configured \
  -u admin:admin

# Or use Alfresco ticket
TICKET=$(curl -X POST http://localhost:8080/alfresco/api/-default-/public/authentication/versions/1/tickets \
  -H "Content-Type: application/json" \
  -d '{"userId":"admin","password":"admin"}' | jq -r '.entry.id')

curl -X POST "http://localhost:9090/api/sync/configured?alf_ticket=$TICKET"

API Usage

Batch Ingester (port 9090)

Start Synchronization

# Sync configured folders
curl -X POST http://localhost:9090/api/sync/configured -u admin:admin

# Sync specific folder
curl -X POST http://localhost:9090/api/sync/batch \
  -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{"folders": ["node-id"], "recursive": true, "types": ["cm:content"]}'

Monitor Progress

# Overall status
curl http://localhost:9090/api/sync/status -u admin:admin

# Job-specific status
curl http://localhost:9090/api/sync/status/{jobId} -u admin:admin

Reconcile Alfresco Permissions

Use this after an Alfresco permission change when you want to force reconciliation manually. It updates hxpr ACLs without re-running text extraction or embeddings.

# Reconcile a single file ACL
curl -X POST http://localhost:9090/api/sync/permissions \
  -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{"nodeIds":["file-node-id"],"recursive":true}'

# Reconcile a folder ACL across its descendant files
curl -X POST http://localhost:9090/api/sync/permissions \
  -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{"nodeIds":["folder-node-id"],"recursive":true}'

Query Node Status

# Single node
curl http://localhost:9090/api/content-lake/nodes/{nodeId}/status -u admin:admin

# Bulk node list
curl -X POST http://localhost:9090/api/content-lake/nodes/status \
  -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{"nodeIds":["node-id-1","node-id-2"]}'

# Optional: include aggregated subtree status for folders
curl -X POST http://localhost:9090/api/content-lake/nodes/status \
  -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{"nodeIds":["folder-id"],"includeFolderAggregate":true}'

# Optional: same aggregation for single-folder lookup
curl "http://localhost:9090/api/content-lake/nodes/{folderId}/status?includeFolderAggregate=true" \
  -u admin:admin

RAG Service (port 9091)

RAG Prompt

Ask a question and get an LLM-generated answer grounded in your indexed Alfresco and Nuxeo documents:

curl -X POST http://localhost:9091/api/rag/prompt -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{ "question": "What are the key findings in the Q4 report?" }'

With options:

curl -X POST http://localhost:9091/api/rag/prompt -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{
    "question": "Summarize the budget proposal",
    "sourceType": "nuxeo",
    "topK": 10,
    "minScore": 0.6,
    "includeContext": true
  }'

Multi-turn conversation (same sessionId):

# Turn 1
curl -X POST http://localhost:9091/api/rag/prompt -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{
    "sessionId": "demo-session-1",
    "question": "Summarize the Q4 report highlights"
  }'

# Turn 2 (follow-up resolved with history)
curl -X POST http://localhost:9091/api/rag/prompt -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{
    "sessionId": "demo-session-1",
    "question": "Can you expand on the second point?"
  }'

Response:

{
  "answer": "The Q4 report highlights a 12% revenue increase...",
  "question": "What are the key findings in the Q4 report?",
  "sessionId": "demo-session-1",
  "retrievalQuery": "what are the key findings in the q4 report",
  "historyTurnsUsed": 2,
  "model": "ai/gpt-oss",
  "tokenCount": 672,
  "searchTimeMs": 245,
  "generationTimeMs": 1830,
  "totalTimeMs": 2075,
  "sourcesUsed": 3,
  "sources": [
    {
      "documentId": "abc-123",
      "sourceId": "nuxeo:nuxeo-demo",
      "sourceType": "nuxeo",
      "nodeId": "e4f5a6b7-...",
      "name": "Q4-Financial-Report.pdf",
      "path": "/default-domain/workspaces/finance",
      "openInSourceUrl": "http://localhost:8081/nuxeo/ui/#!/browse/default-domain/workspaces/finance/Q4-Financial-Report.pdf",
      "chunkText": "Revenue for Q4 increased by 12%...",
      "score": 0.87
    }
  ]
}

Field	Type	Default	Description
`question`	String	required	Natural-language question
`sessionId`	String	user-scoped default	Conversation session id for multi-turn context
`resetSession`	boolean	false	Clear conversation history for the target session before this prompt
`topK`	int	5	Number of chunks to retrieve for context
`minScore`	double	0.5	Minimum similarity threshold
`filter`	String	—	Additional HXQL filter
`sourceType`	String	—	Optional source filter: `alfresco` or `nuxeo`
`systemPrompt`	String	—	Override the default LLM system prompt
`includeContext`	boolean	false	Include retrieved chunks in response

Response Field	Type	Description
`sessionId`	String	Effective session id used by server
`retrievalQuery`	String	Query actually sent to retrieval (may be reformulated)
`historyTurnsUsed`	Integer	Number of prior turns included in this generation
`tokenCount`	Integer	Total token usage (prompt + completion) when provider reports it
`sources[].sourceType`	String	Source type for each cited document
`sources[].openInSourceUrl`	String	Native-source deep link (Share for Alfresco, Web UI for Nuxeo)

Chat Stream (SSE)

Streaming responses are available with Server-Sent Events (SSE).

Canonical endpoint: GET /api/rag/chat/stream
Backward-compatible endpoint: POST /api/rag/chat/stream (same JSON body as /api/rag/prompt)
Content type: text/event-stream
Authentication: same as other /api/rag/** endpoints (Basic Auth or Alfresco ticket)

GET example:

curl -N -G http://localhost:9091/api/rag/chat/stream -u admin:admin \
  --data-urlencode "question=What changed in Q4?" \
  --data-urlencode "sessionId=demo-session-1" \
  --data-urlencode "resetSession=false" \
  --data-urlencode "topK=5" \
  --data-urlencode "minScore=0.5"

Compatibility POST example:

curl -N -X POST http://localhost:9091/api/rag/chat/stream -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What changed in Q4?",
    "sessionId": "demo-session-1",
    "topK": 5,
    "minScore": 0.5
  }'

Query params for GET:

Field	Type	Default	Description
`question`	String	required	Natural-language question
`sessionId`	String	user-scoped default	Conversation session id for multi-turn context
`resetSession`	boolean	false	Clear conversation history before this prompt
`topK`	int	5	Number of chunks to retrieve for context
`minScore`	double	0.5	Minimum similarity threshold
`filter`	String	—	Additional HXQL filter
`sourceType`	String	—	Optional source filter: `alfresco` or `nuxeo`
`embeddingType`	String	model default	Embedding type to match
`systemPrompt`	String	—	Override the default LLM system prompt
`includeContext`	boolean	false	Include retrieved chunks in final metadata

SSE events:

event: token incremental token payload ({"token":"..."})
event: metadata final payload with RagPromptResponse fields including sources, timing fields, model, and tokenCount
event: done terminal success event
event: error terminal failure event with error message

Example stream:

event: token
data: {"token":"Revenue "}

event: token
data: {"token":"grew 12% in Q4."}

event: metadata
data: {"answer":"Revenue grew 12% in Q4.","question":"What changed in Q4?","model":"ai/gpt-oss","tokenCount":672,"searchTimeMs":245,"generationTimeMs":1830,"totalTimeMs":2075,"sourcesUsed":3,"sources":[{"documentId":"abc-123","sourceId":"nuxeo:nuxeo-demo","sourceType":"nuxeo","nodeId":"e4f5a6b7-...","name":"Q4-Financial-Report.pdf","path":"/default-domain/workspaces/finance","openInSourceUrl":"http://localhost:8081/nuxeo/ui/#!/browse/default-domain/workspaces/finance/Q4-Financial-Report.pdf","chunkText":"Revenue for Q4 increased by 12%...","score":0.87}]}

event: done
data: {"status":"ok"}

Error stream example:

event: error
data: {"message":"Failed to prepare RAG stream: ..."}

Semantic Search

Search directly against the embedded chunks without LLM generation:

curl -X POST http://localhost:9091/api/rag/search/semantic -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{ "query": "contract renewal terms", "topK": 5, "minScore": 0.6 }'

Semantic search applies a minimum similarity score to suppress low-quality vector matches when no strong semantic relation exists.

Results can include both Alfresco and Nuxeo hits in the same response. Each hit now includes sourceType and openInSourceUrl so clients can label and open the native source system directly.

{
  "query": "contract renewal terms",
  "resultCount": 2,
  "results": [
    {
      "rank": 1,
      "score": 0.91,
      "chunkText": "The renewal clause starts on page 3...",
      "sourceDocument": {
        "documentId": "doc-alf-1",
        "sourceId": "alfresco:repo-main",
        "sourceType": "alfresco",
        "nodeId": "550e8400-e29b-41d4-a716-446655440000",
        "name": "Vendor Contract.pdf",
        "path": "/Company Home/Sites/legal/documentLibrary",
        "mimeType": "application/pdf",
        "openInSourceUrl": "http://localhost:80/share/page/document-details?nodeRef=workspace://SpacesStore/550e8400-e29b-41d4-a716-446655440000"
      }
    },
    {
      "rank": 2,
      "score": 0.88,
      "chunkText": "Renewal requires 30 days notice...",
      "sourceDocument": {
        "documentId": "doc-nux-1",
        "sourceId": "nuxeo:nuxeo-demo",
        "sourceType": "nuxeo",
        "nodeId": "660e8400-e29b-41d4-a716-446655440000",
        "name": "Supplier Agreement.docx",
        "path": "/default-domain/workspaces/legal",
        "mimeType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
        "openInSourceUrl": "http://localhost:8081/nuxeo/ui/#!/browse/default-domain/workspaces/legal/Supplier%20Agreement.docx"
      }
    }
  ]
}

Default value: 0.5
Applied server-side after vector retrieval
Can be overridden per request

Hybrid Search

Run vector + keyword retrieval and fuse results with rrf (default) or weighted scoring:

curl -X POST http://localhost:9091/api/rag/search/hybrid -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{
    "query": "budget approval process",
    "strategy": "rrf",
    "candidateCount": 20,
    "maxResults": 5,
    "metadata": {
      "mimeType": "application/pdf",
      "pathPrefix": "/Company Home/Sites/finance/documentLibrary",
      "modifiedAfter": "2026-01-01T00:00:00Z",
      "modifiedBefore": "2026-12-31T23:59:59Z",
      "properties": {
        "cm:title": "Budget"
      }
    }
  }'

Structured metadata filters are optional. You can still pass a raw HXQL filter for advanced cases. Use sourceType when you want to restrict the request to a single source system without writing raw HXQL.

Response example:

{
  "query": "budget approval process",
  "strategy": "weighted",
  "normalization": "max",
  "model": "ai/mxbai-embed-large",
  "resultCount": 2,
  "vectorCandidates": 20,
  "keywordCandidates": 18,
  "searchTimeMs": 143,
  "results": [
    {
      "rank": 1,
      "score": 0.0325,
      "chunkText": "The budget approval workflow starts with...",
      "sourceDocument": {
        "documentId": "doc-nux-1",
        "sourceId": "nuxeo:nuxeo-demo",
        "sourceType": "nuxeo",
        "nodeId": "660e8400-e29b-41d4-a716-446655440000",
        "name": "Budget Policy.pdf",
        "path": "/default-domain/workspaces/finance",
        "mimeType": "application/pdf",
        "openInSourceUrl": "http://localhost:8081/nuxeo/ui/#!/browse/default-domain/workspaces/finance/Budget%20Policy.pdf"
      },
      "vectorScore": 0.87,
      "keywordScore": 1.0,
      "vectorRank": 2,
      "keywordRank": 1
    }
  ]
}

Field	Type	Default	Description
`query`	String	required	Query for both vector and keyword legs
`strategy`	String	`rrf`	Fusion strategy: `rrf` or `weighted`
`normalization`	String	`max`	Weighted score normalization: `max` or `minmax`
`candidateCount`	int	`20`	Candidates retrieved from each leg before fusion
`maxResults`	int	`5`	Final fused result limit
`vectorWeight`	double	`0.7`	Weight when `strategy=weighted`
`textWeight`	double	`0.3`	Weight when `strategy=weighted`
`filter`	String	—	Additional raw HXQL filter
`sourceType`	String	—	Optional source filter: `alfresco` or `nuxeo`
`metadata.mimeType`	String	—	MIME type filter (for example `application/pdf`)
`metadata.pathPrefix`	String	—	Path prefix filter (starts-with match)
`metadata.modifiedAfter`	String	—	Inclusive lower bound for `source_modifiedAt`
`metadata.modifiedBefore`	String	—	Inclusive upper bound for `source_modifiedAt`
`metadata.properties`	Map<String,String>	—	Exact-match filters on `cin_ingestProperties.<key>`

Response Field	Type	Description
`query`	String	Original query
`strategy`	String	Effective fusion strategy used
`normalization`	String	Normalization mode used when `strategy=weighted`
`model`	String	Embedding model used for vector search
`resultCount`	int	Number of fused results returned
`vectorCandidates`	int	Number of vector candidates retrieved
`keywordCandidates`	int	Number of keyword candidates retrieved
`searchTimeMs`	long	Total hybrid search execution time
`results[].score`	double	Fused score (RRF or weighted)
`results[].vectorScore`	Double	Raw vector score, if available
`results[].keywordScore`	Double	Raw keyword score, if available
`results[].sourceDocument`	object	Source document metadata
`results[].chunkMetadata`	object	Chunk position/type metadata

Integration Smoke Test (local hxpr)

Use this checklist to validate issue #14 end-to-end:

Ensure at least one folder is ingested into hxpr via batch/live ingesters.
Call hybrid search without metadata constraints and verify resultCount > 0.
Call hybrid search with a restrictive metadata filter (for example mimeType: application/pdf) and confirm results narrow.
Switch strategy to weighted and confirm response field strategy is weighted.
Confirm Nuxeo hits expose openInSourceUrl values that open in Nuxeo Web UI.

Example smoke-test requests:

# Baseline
curl -X POST http://localhost:9091/api/rag/search/hybrid -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{"query":"budget approval process","strategy":"rrf","candidateCount":20,"maxResults":5}'

# Restrictive metadata
curl -X POST http://localhost:9091/api/rag/search/hybrid -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{"query":"budget approval process","strategy":"rrf","sourceType":"nuxeo","metadata":{"mimeType":"application/pdf"}}'

# Weighted strategy
curl -X POST http://localhost:9091/api/rag/search/hybrid -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{"query":"budget approval process","strategy":"weighted","normalization":"minmax","vectorWeight":0.7,"textWeight":0.3}'

Health Checks

# Batch ingester (no auth required)
curl http://localhost:9090/actuator/health

# Live ingester (no auth required)
curl http://localhost:9092/actuator/health

# RAG service (no auth required)
curl http://localhost:9091/actuator/health

# RAG service detailed health (auth required)
curl http://localhost:9091/api/rag/health -u admin:admin

Live Ingester (port 9092)

The live ingester consumes Alfresco Event2 messages from ActiveMQ using Alfresco Java SDK handler interfaces such as OnNodeUpdatedEventHandler and OnPermissionUpdatedEventHandler.

It reuses the same shared ingestion pipeline as the batch ingester:

Fetch the current node snapshot from Alfresco REST API
Apply scope and exclusion rules
Sync metadata to hxpr
Extract text with Transform Service
Chunk and embed with Spring AI
Update permissions or delete when nodes move out of scope

Permission reconciliation is separate from content updates:

Content and scope changes are handled through Event2 live ingestion.
Alfresco permission changes should be reconciled through POST /api/sync/permissions in alfresco-batch-ingester because the repository does not reliably emit permission update events.
In production, the content-lake-repo-model addon inside Alfresco Repository should detect ACL changes after commit and publish a persistent ActiveMQ queue message. alfresco-batch-ingester consumes that queue and runs the same ACL reconciliation path.
If a permission event is emitted, the live ingester can still process it, but that path is best-effort rather than the primary contract.

When the live ingester does receive a permission-related event, it distinguishes between file and folder targets:

File-level event: the ACL is updated only for that file (updatePermissions) — no content re-extraction or embedding regeneration.
Folder-level event: the live ingester walks the full descendant subtree and applies an ACL-only update to every indexed file beneath the folder. This covers three event types that can signal a folder ACL change: PERMISSION_UPDATED, PEER_ASSOC_CREATED, and PEER_ASSOC_DELETED. A fourth handler (FolderPermissionFallbackHandler) catches NODE_UPDATED events on folders where only the ACL changed (no structural diff), providing a safety net for sources that do not emit a dedicated permission event.

Folder-level propagation behaviour:

Descendant files with isInheritanceEnabled: false keep their locally-set ACL unchanged — the folder's new permissions are not pushed down to them.
Descendant files with inheritance enabled receive a recomputed ACL derived from the folder's current Alfresco permissions snapshot.
Files that fall outside scope after the change are deleted from hxpr rather than updated.
The propagation never re-ingests content; it is strictly an ACL patch.

The live path is guarded by the same alfresco_modifiedAt staleness check used by batch ingestion, so batch and live runs can coexist safely.

Status endpoint:

curl http://localhost:9092/api/live/status

Configuration

Ingestion

Edit alfresco/alfresco-batch-ingester/src/main/resources/application.yml:

ingestion:
  sources:
    - folder: your-folder-node-id
      recursive: true
      types: [cm:content]
  exclude:
    paths: ["*/surf-config/*", "*/thumbnails/*"]
    aspects: [cm:workingcopy]

Live Ingestion

Edit alfresco/alfresco-live-ingester/src/main/resources/application.yml:

spring:
  activemq:
    broker-url: ${ACTIVEMQ_URL:tcp://localhost:61616}
    user: ${ACTIVEMQ_USER:admin}
    password: ${ACTIVEMQ_PASSWORD:admin}
  jms:
    cache:
      enabled: false

alfresco:
  events:
    topic-name: ${ALFRESCO_EVENT_TOPIC:alfresco.repo.event2}
    enable-handlers: true
    enable-spring-integration: false

live-ingester:
  filter:
    exclude-paths: ["*/surf-config/*", "*/thumbnails/*"]
    exclude-aspects: [cm:workingcopy]
  scope:
    include-paths: []
    required-aspects: []
  dedup:
    window: ${LIVE_INGESTER_DEDUP_WINDOW:PT2M}
    max-entries: ${LIVE_INGESTER_DEDUP_MAX_ENTRIES:10000}

Notes:

spring.jms.cache.enabled=false is required so the Alfresco Java SDK can use the native ActiveMQ connection factory.
By default, the live ingester behaves as an exclude-only listener. Set include-paths or required-aspects to narrow the scope.
Transform Service receives the original Alfresco filename when available, improving binary format detection during text extraction.

RAG

Edit common/rag-service/src/main/resources/application.yml:

spring:
  ai:
    openai:
      chat:
        options:
          model: ${LLM_MODEL:ai/gpt-oss}
          temperature: ${LLM_TEMPERATURE:0.3}
          maxTokens: ${LLM_MAX_TOKENS:1024}

rag:
  default-top-k: 5
  default-min-score: 0.5
  max-context-length: 12000
  default-system-prompt: >
    You are a document assistant that answers questions based strictly on
    the provided context.

    RULES:
    1. Use ONLY information from the DOCUMENT CONTEXT below. Do not use prior knowledge.
    2. When referencing information, cite the source using its label (e.g. "According to Source 1...").
    3. If multiple sources contain relevant information, synthesize them and cite each.
    4. If the context does not contain enough information to fully answer the question,
    clearly state what you can answer and what is missing.
    5. Be concise and direct. Do not repeat the question or add unnecessary preamble.
  conversation:
    enabled: true
    max-history-turns: 10
    session-ttl-minutes: 30
    query-reformulation: true

semantic-search:
  default-min-score: 0.5

search:
  hybrid:
    enabled: true
    strategy: rrf       # or weighted
    normalization: max  # max or minmax (weighted strategy)
    vector-weight: 0.7
    text-weight: 0.3
    initial-candidates: 20
    final-results: 5
    rrf-k: 60
    default-min-score: 0.0

Conversation memory storage:

Default implementation is in-memory.
To use Redis or a database, provide a custom Spring bean implementing ConversationMemoryStore; the default in-memory store is only created when no other ConversationMemoryStore bean exists.

Roadmap

Next (Q2 2026 - Open Source Release)

Harden live-ingester with end-to-end Event2 coverage and operational guidance
OAuth2/Keycloak integration
Comprehensive testing suite
Production deployment guide

Future

Conversation history / multi-turn chat sessions
Re-ranking with cross-encoder models
Multiple embedding models per document
Document versioning support
DocFilters integration (better text extraction)
Multilingual embeddings
Performance optimizations for 10K+ documents

Development

Build

mvn clean package

Run Tests

mvn test

Run Locally

# Alfresco Batch Ingester
mvn spring-boot:run -pl alfresco/alfresco-batch-ingester -am
# or
java -jar alfresco/alfresco-batch-ingester/target/alfresco-batch-ingester-1.0.0-SNAPSHOT.jar

# Alfresco Live Ingester
mvn spring-boot:run -pl alfresco/alfresco-live-ingester -am
# or
java -jar alfresco/alfresco-live-ingester/target/alfresco-live-ingester-1.0.0-SNAPSHOT.jar

# Nuxeo Batch Ingester
mvn spring-boot:run -pl nuxeo/nuxeo-batch-ingester -am
# or
java -jar nuxeo/nuxeo-batch-ingester/target/nuxeo-batch-ingester-1.0.0-SNAPSHOT.jar

# Nuxeo Live Ingester
mvn spring-boot:run -pl nuxeo/nuxeo-live-ingester -am
# or
java -jar nuxeo/nuxeo-live-ingester/target/nuxeo-live-ingester-1.0.0-SNAPSHOT.jar

# RAG Service
mvn spring-boot:run -pl common/rag-service -am
# or
java -jar common/rag-service/target/rag-service-1.0.0-SNAPSHOT.jar

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'feat: add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

Acknowledgments

Built with Spring AI
Uses Alfresco Java SDK
Powered by hxpr Content Lake
Created for the Alfresco/Hyland community

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
alfresco		alfresco
common		common
docs		docs
nuxeo		nuxeo
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compose.nuxeo.yaml		compose.nuxeo.yaml
compose.yaml		compose.yaml
pom.xml		pom.xml

Folders and files

Latest commit

History

Repository files navigation

Content Lake App

Related Projects

Documentation

Overview

Features

Architecture

Modules

Quick Start

Prerequisites

Installation

Alfresco Repo Model

Starting From A Non-Indexed Repository

Environment Variables

Nuxeo Backfill And Live Sync

Authentication

Supported Methods

Source-Native ACL Filtering

Quick Example

API Usage

Batch Ingester (port 9090)

Start Synchronization

Monitor Progress

Reconcile Alfresco Permissions

Query Node Status

RAG Service (port 9091)

RAG Prompt

Chat Stream (SSE)

Semantic Search

Hybrid Search

Integration Smoke Test (local hxpr)

Health Checks

Live Ingester (port 9092)

Configuration

Ingestion

Live Ingestion

RAG

Roadmap

Next (Q2 2026 - Open Source Release)

Future

Development

Build

Run Tests

Run Locally

Contributing

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages