Skip to content

aborroy/content-lake-app

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Content Lake App

License Java Spring Boot Maven Docker Status

AI-powered semantic search and RAG over Alfresco and Nuxeo content using hxpr

FeaturesQuick StartArchitectureAuthenticationAPI UsageConfiguration

Related Projects

  • alfresco-content-lake-ui - ACA-based frontend for semantic search and RAG over Content Lake.
  • content-lake-app-ui - Demo UI for the Content Lake App that provides dual authentication (Alfresco + Nuxeo)
  • content-lake-app-deployment - Docker Compose deployment for Alfresco, Nuxeo, hxpr, Content Lake services, and the UI.
  • nuxeo-deployment - Companion project that builds and runs the local Nuxeo server. Required when using compose.nuxeo.yaml in this repo.

Documentation

Doc Contents
docs/architecture.md Module layout, SPI interfaces, dependency graph, data model, design decisions
docs/sync-pipeline.md Full/live sync flows, metadata-only path, path structure, idempotency, scope resolution

Overview

Proof of Concept for AI-powered semantic search and Retrieval-Augmented Generation (RAG) over Alfresco and Nuxeo content.

Leverages hxpr as a Content Lake to enable high-quality AI search while:

  • Keeping Alfresco and Nuxeo as the sources of truth
  • Enforcing server-side permissions via ACLs
  • Supporting on-premises AI execution
  • Minimizing data duplication

Features

  • Two-Phase Sync Pipeline: Fast metadata ingestion + async content processing
  • Near Real-Time Sync: Alfresco Event2 listener over ActiveMQ using the Alfresco Java SDK
  • Semantic Search: Vector embeddings with permission-aware kNN search
  • RAG: LLM-powered question answering grounded in Alfresco document content
  • Permission-Aware: Server-side ACL enforcement via hxpr
  • Local AI: On-premises LLM and embedding models using Spring AI
  • Repository Scope Model: cl:indexed and cl:excludeFromLake for Alfresco-native scope control
  • REST API: Generic connector using Alfresco REST APIs
  • Secured Endpoints: Alfresco authentication (username/password or tickets)
  • Shared Ingestion Core: Common metadata, transform, chunking, embedding, ACL, and delete/update logic in content-lake-core
  • Idempotent Coexistence: alfresco_modifiedAt guard prevents stale batch/live writes from overwriting newer content

Architecture

  ┌────────────────────────────────────┐   ┌────────────────────────────────────┐
  │ Alfresco Repository + Event2       │   │ Nuxeo + Audit Stream               │
  │ REST API + ActiveMQ topic          │   │ REST API + audit log watermark     │
  └────────────────────────────────────┘   └────────────────────────────────────┘
         │                   │                    │                   │
         ▼                   ▼                    ▼                   ▼
  ┌────────────┐    ┌──────────────┐    ┌──────────────────┐  ┌──────────────────┐
  │ alfresco-  │    │ alfresco-    │    │ nuxeo-batch-     │  │ nuxeo-live-      │
  │ batch-     │    │ live-        │    │ ingester         │  │ ingester         │
  │ ingester   │    │ ingester     │    │ NXQL Discovery   │  │ Audit Watermark  │
  └────────────┘    └──────────────┘    └──────────────────┘  └──────────────────┘
         │                   │                    │                   │
         └───────────────────┴────────────────────┴───────────────────┘
                                          ▼
                    ┌──────────────────────────────────────────┐
                    │ content-lake-core                        │
                    │ Node sync, Transform, Chunk, Embed, ACL  │
                    │ source_modifiedAt idempotency guard      │
                    └──────────────────────────────────────────┘
                                          ▼
                    ┌──────────────────────────────────────────┐
                    │ hxpr Content Lake                        │
                    └──────────────────────────────────────────┘
                                          ▼
                    ┌──────────────────────────────────────────┐
                    │ rag-service                              │
                    │ Query → Embed → Search → Augment → LLM   │
                    └──────────────────────────────────────────┘

Modules

Module Group Port Description
content-lake-repo-model common/ Alfresco repository JAR that bootstraps the cl:indexed content model for scope control
content-lake-spi common/ Source Provider Interface: SourceNode, ContentSourceClient, TextExtractor, ScopeResolver
content-lake-core common/ Shared ingestion pipeline: metadata sync, transform, chunking, embedding, ACL updates, idempotency
rag-service common/ 9091 Semantic search, hybrid search, and RAG question answering
content-lake-source-alfresco alfresco/ Alfresco REST clients, scope resolver, and ACL expansion
alfresco-batch-ingester alfresco/ 9090 Alfresco folder discovery, batch scheduling, and /api/sync/* controllers
alfresco-live-ingester alfresco/ 9092 Alfresco Event2 listener over ActiveMQ using Alfresco Java SDK handlers
content-lake-source-nuxeo nuxeo/ Nuxeo REST clients, scope resolver, auth abstraction, and text extraction
nuxeo-batch-ingester nuxeo/ 9093 Nuxeo full-batch discovery and one-shot sync using NXQL
nuxeo-live-ingester nuxeo/ 9094 Nuxeo audit-stream listener using a persisted watermark

Quick Start

Prerequisites

  • Java 21+ and Maven 3.9+
  • Docker and Docker Compose
  • Alfresco Content Services 25.x+
    • Alfresco Transform Service (for text extraction)
  • hxpr Content Lake (with OAuth2 IDP)
  • Docker Model Runner (for embeddings and LLM)

Installation

# Clone repository
git clone https://github.com/aborroy/content-lake-app.git
cd content-lake-app

# Build all modules
mvn clean package

# Deploy the repository content model to ACS before starting the ingesters
# Artifact:
#   common/content-lake-repo-model/target/content-lake-repo-model-1.0.0-SNAPSHOT.jar
# Deploy it to the Alfresco Repository classpath.

# Configure (see Environment Variables below)
export ALFRESCO_URL=http://localhost:8080
export ALFRESCO_INTERNAL_USERNAME=admin
export ALFRESCO_INTERNAL_PASSWORD=admin
# ... (see full configuration below)

# Run batch ingestion
java -jar alfresco/alfresco-batch-ingester/target/alfresco-batch-ingester-1.0.0-SNAPSHOT.jar

# Run live ingestion
java -jar alfresco/alfresco-live-ingester/target/alfresco-live-ingester-1.0.0-SNAPSHOT.jar

# Run RAG service
java -jar common/rag-service/target/rag-service-1.0.0-SNAPSHOT.jar

# Or with Docker Compose (full stack)
cd ../content-lake-app-deployment && docker compose up --build

Alfresco Repo Model

The batch and live ingesters now rely on an Alfresco content model for scope control:

  • cl:indexed marks a folder subtree as in scope for Content Lake ingestion
  • cl:excludeFromLake lets a file opt out, or a folder subtree opt out, even when an ancestor folder is indexed

Build artifact:

common/content-lake-repo-model/target/content-lake-repo-model-1.0.0-SNAPSHOT.jar

Deploy that JAR to the Alfresco Repository classpath before enabling ingestion. Typical options are:

  • include it in an ACS SDK modules/platform build
  • copy or mount it into an Alfresco Repository image under webapps/alfresco/WEB-INF/lib

Starting From A Non-Indexed Repository

If your Alfresco Repository does not yet use cl:indexed, the recommended startup sequence is:

  1. Build the project and deploy the repository model JAR to Alfresco Repository. After deployment, restart the repository so cl:indexed and cl:excludeFromLake are available.
  2. Start batch-ingester.
  3. Run a batch synchronization against the folder you want to onboard. The ingester automatically adds cl:indexed to each root folder if it is not already present, then performs the initial backfill into Content Lake.
  4. Start live-ingester. Live ingestion then keeps that indexed subtree up to date.

Example for indexing all sites under Company Home/Sites:

  1. Resolve the Alfresco node id for Company Home/Sites. You can obtain it from Alfresco UI tools or the Alfresco REST API.
  2. Run the batch sync against that folder:
curl -X POST http://localhost:9090/api/sync/batch \
  -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{"folders":["SITES_FOLDER_NODE_ID"],"recursive":true,"types":["cm:content"]}'

This single call marks SITES_FOLDER_NODE_ID with cl:indexed (if needed) and ingests all existing content beneath it.

  1. After the batch completes, start live-ingester so new or changed content under Company Home/Sites continues to sync automatically.

Important:

  • cl:indexed can also be set directly via the Alfresco Repository nodes API or the Content Lake UI extension; the batch ingester sets it automatically only for root folders passed in the request
  • cl:excludeFromLake on a folder removes that folder's full subtree from Content Lake scope; batch discovery skips it and live reconciliation deletes previously ingested descendants
  • if you later want to index only one site, pass that site folder to /api/sync/batch instead of Company Home/Sites

Environment Variables

# Alfresco (Internal Service Account)
export ALFRESCO_URL=http://localhost:8080
export ALFRESCO_INTERNAL_USERNAME=admin
export ALFRESCO_INTERNAL_PASSWORD=admin

# hxpr Content Lake
export HXPR_URL=http://localhost:8080
export HXPR_REPOSITORY_ID=default
export HXPR_IDP_TOKEN_URL=http://localhost:5002/idp/connect/token
export HXPR_IDP_CLIENT_ID=nuxeo-client
export HXPR_IDP_CLIENT_SECRET=secret
export HXPR_IDP_USERNAME=testuser
export HXPR_IDP_PASSWORD=password

# Transform Service (batch-ingester only)
export TRANSFORM_URL=http://localhost:10090
export TRANSFORM_ENABLED=true

# ActiveMQ / Event2 (live-ingester only)
export ACTIVEMQ_URL=tcp://localhost:61616
export ACTIVEMQ_USER=admin
export ACTIVEMQ_PASSWORD=admin
export ALFRESCO_EVENT_TOPIC=alfresco.repo.event2

# Nuxeo (Nuxeo ingesters + rag-service authority lookup)
export NUXEO_URL=http://localhost:8081/nuxeo
export NUXEO_USERNAME=Administrator
export NUXEO_PASSWORD=Administrator
export NUXEO_SOURCE_ID=local

# AI/Embeddings (both services)
# Spring AI appends /v1 itself; use the Docker Model Runner root URL.
export MODEL_RUNNER_URL=http://localhost:12434
export EMBEDDING_MODEL=ai/mxbai-embed-large

# LLM (rag-service only)
export LLM_MODEL=ai/gpt-oss
export LLM_TEMPERATURE=0.3
export LLM_MAX_TOKENS=1024

# RAG defaults (rag-service only)
export RAG_DEFAULT_TOP_K=5
export RAG_DEFAULT_MIN_SCORE=0.5
export RAG_MAX_CONTEXT_LENGTH=12000

# Performance (batch-ingester only)
export TRANSFORM_WORKERS=4
export EMBEDDING_CHUNK_SIZE=900
export EMBEDDING_CHUNK_OVERLAP=120

Nuxeo Backfill And Live Sync

compose.nuxeo.yaml starts both the nuxeo-batch-ingester and the nuxeo-live-ingester. The Nuxeo server itself is provided by the nuxeo-deployment companion project, which must be running before you start this stack.

Step 1 — start Nuxeo (in a separate terminal, from the sibling nuxeo-deployment/ directory):

git clone https://github.com/aborroy/nuxeo-deployment.git ../nuxeo-deployment
cd ../nuxeo-deployment
docker compose up --build

Nuxeo will be available at http://localhost:8081/nuxeo once healthy.

Step 2 — start the Nuxeo ingesters (from this directory):

docker compose -f compose.nuxeo.yaml up --build

This starts:

  • nuxeo-batch-ingester on http://localhost:9093 for one-shot backfills
  • nuxeo-live-ingester on http://localhost:9094 for audit-driven incremental sync

Defaults:

  • Nuxeo credentials: Administrator / Administrator
  • Discovery mode: NXQL
  • Included roots: /default-domain/workspaces
  • Included types: File, Note

Trigger a full configured backfill:

curl -X POST http://localhost:9093/api/sync/configured \
  -u Administrator:Administrator

Trigger a custom backfill with request overrides:

curl -X POST http://localhost:9093/api/sync/batch \
  -u Administrator:Administrator \
  -H "Content-Type: application/json" \
  -d '{
    "includedRoots": ["/default-domain/workspaces"],
    "includedDocumentTypes": ["File", "Note"],
    "excludedLifecycleStates": ["deleted"],
    "pageSize": 50,
    "discoveryMode": "NXQL"
  }'

Check status:

curl http://localhost:9093/api/sync/status -u Administrator:Administrator
curl http://localhost:9093/api/sync/status/{jobId} -u Administrator:Administrator

The live listener has no manual sync API. Use the actuator endpoints for health and metrics:

curl http://localhost:9094/actuator/health
curl http://localhost:9094/actuator/metrics

When using the deployment repo's reverse proxy, the public sync API remains /api/sync/*. Route to Nuxeo by adding ?sourceType=nuxeo; omit it or use alfresco for the existing Alfresco ingester.

Authentication

REST API authentication is source-specific:

  • Alfresco ingesters validate incoming credentials or tickets against Alfresco.
  • nuxeo-batch-ingester uses HTTP Basic auth with the configured Nuxeo service credentials.
  • nuxeo-live-ingester does not expose sync APIs; health and metrics come from Spring Actuator.

Supported Methods

Method Example
Basic Auth curl -u admin:password http://localhost:9090/api/sync/status
Ticket (query) curl "http://localhost:9090/api/sync/status?alf_ticket=TICKET_xxx"
Ticket (header) curl -H "Authorization: Basic BASE64(TICKET_xxx)" ...

Note: Bearer token authentication (OAuth2/OIDC with Keycloak) is not yet supported.

Source-Native ACL Filtering

Current mixed-source filtering keeps Alfresco and Nuxeo principals source-native:

  • Ingested ACLs are written to hxpr with the source instance suffix _#_<sourceId>.
  • Alfresco and Nuxeo principals are not normalized to a shared identity yet.
  • rag-service expands Alfresco groups from Alfresco and Nuxeo groups from Nuxeo, then applies them only to matching source IDs.
  • Alfresco repository admins keep repository-admin discoverability for Alfresco sources without storing synthetic admin ACEs in sys_acl.
  • This mode assumes the authenticated username is the same login string in each source you want to query.
  • Nuxeo group expansion in rag-service uses the configured NUXEO_USERNAME and NUXEO_PASSWORD service credentials to read /api/v1/user/{username}.

Quick Example

# Authenticate and start sync
curl -X POST http://localhost:9090/api/sync/configured \
  -u admin:admin

# Or use Alfresco ticket
TICKET=$(curl -X POST http://localhost:8080/alfresco/api/-default-/public/authentication/versions/1/tickets \
  -H "Content-Type: application/json" \
  -d '{"userId":"admin","password":"admin"}' | jq -r '.entry.id')

curl -X POST "http://localhost:9090/api/sync/configured?alf_ticket=$TICKET"

API Usage

Batch Ingester (port 9090)

Start Synchronization

# Sync configured folders
curl -X POST http://localhost:9090/api/sync/configured -u admin:admin

# Sync specific folder
curl -X POST http://localhost:9090/api/sync/batch \
  -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{"folders": ["node-id"], "recursive": true, "types": ["cm:content"]}'

Monitor Progress

# Overall status
curl http://localhost:9090/api/sync/status -u admin:admin

# Job-specific status
curl http://localhost:9090/api/sync/status/{jobId} -u admin:admin

Reconcile Alfresco Permissions

Use this after an Alfresco permission change when you want to force reconciliation manually. It updates hxpr ACLs without re-running text extraction or embeddings.

# Reconcile a single file ACL
curl -X POST http://localhost:9090/api/sync/permissions \
  -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{"nodeIds":["file-node-id"],"recursive":true}'

# Reconcile a folder ACL across its descendant files
curl -X POST http://localhost:9090/api/sync/permissions \
  -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{"nodeIds":["folder-node-id"],"recursive":true}'

Query Node Status

# Single node
curl http://localhost:9090/api/content-lake/nodes/{nodeId}/status -u admin:admin

# Bulk node list
curl -X POST http://localhost:9090/api/content-lake/nodes/status \
  -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{"nodeIds":["node-id-1","node-id-2"]}'

# Optional: include aggregated subtree status for folders
curl -X POST http://localhost:9090/api/content-lake/nodes/status \
  -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{"nodeIds":["folder-id"],"includeFolderAggregate":true}'

# Optional: same aggregation for single-folder lookup
curl "http://localhost:9090/api/content-lake/nodes/{folderId}/status?includeFolderAggregate=true" \
  -u admin:admin

RAG Service (port 9091)

RAG Prompt

Ask a question and get an LLM-generated answer grounded in your indexed Alfresco and Nuxeo documents:

curl -X POST http://localhost:9091/api/rag/prompt -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{ "question": "What are the key findings in the Q4 report?" }'

With options:

curl -X POST http://localhost:9091/api/rag/prompt -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{
    "question": "Summarize the budget proposal",
    "sourceType": "nuxeo",
    "topK": 10,
    "minScore": 0.6,
    "includeContext": true
  }'

Multi-turn conversation (same sessionId):

# Turn 1
curl -X POST http://localhost:9091/api/rag/prompt -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{
    "sessionId": "demo-session-1",
    "question": "Summarize the Q4 report highlights"
  }'

# Turn 2 (follow-up resolved with history)
curl -X POST http://localhost:9091/api/rag/prompt -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{
    "sessionId": "demo-session-1",
    "question": "Can you expand on the second point?"
  }'

Response:

{
  "answer": "The Q4 report highlights a 12% revenue increase...",
  "question": "What are the key findings in the Q4 report?",
  "sessionId": "demo-session-1",
  "retrievalQuery": "what are the key findings in the q4 report",
  "historyTurnsUsed": 2,
  "model": "ai/gpt-oss",
  "tokenCount": 672,
  "searchTimeMs": 245,
  "generationTimeMs": 1830,
  "totalTimeMs": 2075,
  "sourcesUsed": 3,
  "sources": [
    {
      "documentId": "abc-123",
      "sourceId": "nuxeo:nuxeo-demo",
      "sourceType": "nuxeo",
      "nodeId": "e4f5a6b7-...",
      "name": "Q4-Financial-Report.pdf",
      "path": "/default-domain/workspaces/finance",
      "openInSourceUrl": "http://localhost:8081/nuxeo/ui/#!/browse/default-domain/workspaces/finance/Q4-Financial-Report.pdf",
      "chunkText": "Revenue for Q4 increased by 12%...",
      "score": 0.87
    }
  ]
}
Field Type Default Description
question String required Natural-language question
sessionId String user-scoped default Conversation session id for multi-turn context
resetSession boolean false Clear conversation history for the target session before this prompt
topK int 5 Number of chunks to retrieve for context
minScore double 0.5 Minimum similarity threshold
filter String Additional HXQL filter
sourceType String Optional source filter: alfresco or nuxeo
systemPrompt String Override the default LLM system prompt
includeContext boolean false Include retrieved chunks in response
Response Field Type Description
sessionId String Effective session id used by server
retrievalQuery String Query actually sent to retrieval (may be reformulated)
historyTurnsUsed Integer Number of prior turns included in this generation
tokenCount Integer Total token usage (prompt + completion) when provider reports it
sources[].sourceType String Source type for each cited document
sources[].openInSourceUrl String Native-source deep link (Share for Alfresco, Web UI for Nuxeo)

Chat Stream (SSE)

Streaming responses are available with Server-Sent Events (SSE).

  • Canonical endpoint: GET /api/rag/chat/stream
  • Backward-compatible endpoint: POST /api/rag/chat/stream (same JSON body as /api/rag/prompt)
  • Content type: text/event-stream
  • Authentication: same as other /api/rag/** endpoints (Basic Auth or Alfresco ticket)

GET example:

curl -N -G http://localhost:9091/api/rag/chat/stream -u admin:admin \
  --data-urlencode "question=What changed in Q4?" \
  --data-urlencode "sessionId=demo-session-1" \
  --data-urlencode "resetSession=false" \
  --data-urlencode "topK=5" \
  --data-urlencode "minScore=0.5"

Compatibility POST example:

curl -N -X POST http://localhost:9091/api/rag/chat/stream -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What changed in Q4?",
    "sessionId": "demo-session-1",
    "topK": 5,
    "minScore": 0.5
  }'

Query params for GET:

Field Type Default Description
question String required Natural-language question
sessionId String user-scoped default Conversation session id for multi-turn context
resetSession boolean false Clear conversation history before this prompt
topK int 5 Number of chunks to retrieve for context
minScore double 0.5 Minimum similarity threshold
filter String Additional HXQL filter
sourceType String Optional source filter: alfresco or nuxeo
embeddingType String model default Embedding type to match
systemPrompt String Override the default LLM system prompt
includeContext boolean false Include retrieved chunks in final metadata

SSE events:

  • event: token incremental token payload ({"token":"..."})
  • event: metadata final payload with RagPromptResponse fields including sources, timing fields, model, and tokenCount
  • event: done terminal success event
  • event: error terminal failure event with error message

Example stream:

event: token
data: {"token":"Revenue "}

event: token
data: {"token":"grew 12% in Q4."}

event: metadata
data: {"answer":"Revenue grew 12% in Q4.","question":"What changed in Q4?","model":"ai/gpt-oss","tokenCount":672,"searchTimeMs":245,"generationTimeMs":1830,"totalTimeMs":2075,"sourcesUsed":3,"sources":[{"documentId":"abc-123","sourceId":"nuxeo:nuxeo-demo","sourceType":"nuxeo","nodeId":"e4f5a6b7-...","name":"Q4-Financial-Report.pdf","path":"/default-domain/workspaces/finance","openInSourceUrl":"http://localhost:8081/nuxeo/ui/#!/browse/default-domain/workspaces/finance/Q4-Financial-Report.pdf","chunkText":"Revenue for Q4 increased by 12%...","score":0.87}]}

event: done
data: {"status":"ok"}

Error stream example:

event: error
data: {"message":"Failed to prepare RAG stream: ..."}

Semantic Search

Search directly against the embedded chunks without LLM generation:

curl -X POST http://localhost:9091/api/rag/search/semantic -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{ "query": "contract renewal terms", "topK": 5, "minScore": 0.6 }'

Semantic search applies a minimum similarity score to suppress low-quality vector matches when no strong semantic relation exists.

Results can include both Alfresco and Nuxeo hits in the same response. Each hit now includes sourceType and openInSourceUrl so clients can label and open the native source system directly.

{
  "query": "contract renewal terms",
  "resultCount": 2,
  "results": [
    {
      "rank": 1,
      "score": 0.91,
      "chunkText": "The renewal clause starts on page 3...",
      "sourceDocument": {
        "documentId": "doc-alf-1",
        "sourceId": "alfresco:repo-main",
        "sourceType": "alfresco",
        "nodeId": "550e8400-e29b-41d4-a716-446655440000",
        "name": "Vendor Contract.pdf",
        "path": "/Company Home/Sites/legal/documentLibrary",
        "mimeType": "application/pdf",
        "openInSourceUrl": "http://localhost:80/share/page/document-details?nodeRef=workspace://SpacesStore/550e8400-e29b-41d4-a716-446655440000"
      }
    },
    {
      "rank": 2,
      "score": 0.88,
      "chunkText": "Renewal requires 30 days notice...",
      "sourceDocument": {
        "documentId": "doc-nux-1",
        "sourceId": "nuxeo:nuxeo-demo",
        "sourceType": "nuxeo",
        "nodeId": "660e8400-e29b-41d4-a716-446655440000",
        "name": "Supplier Agreement.docx",
        "path": "/default-domain/workspaces/legal",
        "mimeType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
        "openInSourceUrl": "http://localhost:8081/nuxeo/ui/#!/browse/default-domain/workspaces/legal/Supplier%20Agreement.docx"
      }
    }
  ]
}
  • Default value: 0.5
  • Applied server-side after vector retrieval
  • Can be overridden per request

Hybrid Search

Run vector + keyword retrieval and fuse results with rrf (default) or weighted scoring:

curl -X POST http://localhost:9091/api/rag/search/hybrid -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{
    "query": "budget approval process",
    "strategy": "rrf",
    "candidateCount": 20,
    "maxResults": 5,
    "metadata": {
      "mimeType": "application/pdf",
      "pathPrefix": "/Company Home/Sites/finance/documentLibrary",
      "modifiedAfter": "2026-01-01T00:00:00Z",
      "modifiedBefore": "2026-12-31T23:59:59Z",
      "properties": {
        "cm:title": "Budget"
      }
    }
  }'

Structured metadata filters are optional. You can still pass a raw HXQL filter for advanced cases. Use sourceType when you want to restrict the request to a single source system without writing raw HXQL.

Response example:

{
  "query": "budget approval process",
  "strategy": "weighted",
  "normalization": "max",
  "model": "ai/mxbai-embed-large",
  "resultCount": 2,
  "vectorCandidates": 20,
  "keywordCandidates": 18,
  "searchTimeMs": 143,
  "results": [
    {
      "rank": 1,
      "score": 0.0325,
      "chunkText": "The budget approval workflow starts with...",
      "sourceDocument": {
        "documentId": "doc-nux-1",
        "sourceId": "nuxeo:nuxeo-demo",
        "sourceType": "nuxeo",
        "nodeId": "660e8400-e29b-41d4-a716-446655440000",
        "name": "Budget Policy.pdf",
        "path": "/default-domain/workspaces/finance",
        "mimeType": "application/pdf",
        "openInSourceUrl": "http://localhost:8081/nuxeo/ui/#!/browse/default-domain/workspaces/finance/Budget%20Policy.pdf"
      },
      "vectorScore": 0.87,
      "keywordScore": 1.0,
      "vectorRank": 2,
      "keywordRank": 1
    }
  ]
}
Field Type Default Description
query String required Query for both vector and keyword legs
strategy String rrf Fusion strategy: rrf or weighted
normalization String max Weighted score normalization: max or minmax
candidateCount int 20 Candidates retrieved from each leg before fusion
maxResults int 5 Final fused result limit
vectorWeight double 0.7 Weight when strategy=weighted
textWeight double 0.3 Weight when strategy=weighted
filter String Additional raw HXQL filter
sourceType String Optional source filter: alfresco or nuxeo
metadata.mimeType String MIME type filter (for example application/pdf)
metadata.pathPrefix String Path prefix filter (starts-with match)
metadata.modifiedAfter String Inclusive lower bound for source_modifiedAt
metadata.modifiedBefore String Inclusive upper bound for source_modifiedAt
metadata.properties Map<String,String> Exact-match filters on cin_ingestProperties.<key>
Response Field Type Description
query String Original query
strategy String Effective fusion strategy used
normalization String Normalization mode used when strategy=weighted
model String Embedding model used for vector search
resultCount int Number of fused results returned
vectorCandidates int Number of vector candidates retrieved
keywordCandidates int Number of keyword candidates retrieved
searchTimeMs long Total hybrid search execution time
results[].score double Fused score (RRF or weighted)
results[].vectorScore Double Raw vector score, if available
results[].keywordScore Double Raw keyword score, if available
results[].sourceDocument object Source document metadata
results[].chunkMetadata object Chunk position/type metadata
Integration Smoke Test (local hxpr)

Use this checklist to validate issue #14 end-to-end:

  1. Ensure at least one folder is ingested into hxpr via batch/live ingesters.
  2. Call hybrid search without metadata constraints and verify resultCount > 0.
  3. Call hybrid search with a restrictive metadata filter (for example mimeType: application/pdf) and confirm results narrow.
  4. Switch strategy to weighted and confirm response field strategy is weighted.
  5. Confirm Nuxeo hits expose openInSourceUrl values that open in Nuxeo Web UI.

Example smoke-test requests:

# Baseline
curl -X POST http://localhost:9091/api/rag/search/hybrid -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{"query":"budget approval process","strategy":"rrf","candidateCount":20,"maxResults":5}'

# Restrictive metadata
curl -X POST http://localhost:9091/api/rag/search/hybrid -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{"query":"budget approval process","strategy":"rrf","sourceType":"nuxeo","metadata":{"mimeType":"application/pdf"}}'

# Weighted strategy
curl -X POST http://localhost:9091/api/rag/search/hybrid -u admin:admin \
  -H "Content-Type: application/json" \
  -d '{"query":"budget approval process","strategy":"weighted","normalization":"minmax","vectorWeight":0.7,"textWeight":0.3}'

Health Checks

# Batch ingester (no auth required)
curl http://localhost:9090/actuator/health

# Live ingester (no auth required)
curl http://localhost:9092/actuator/health

# RAG service (no auth required)
curl http://localhost:9091/actuator/health

# RAG service detailed health (auth required)
curl http://localhost:9091/api/rag/health -u admin:admin

Live Ingester (port 9092)

The live ingester consumes Alfresco Event2 messages from ActiveMQ using Alfresco Java SDK handler interfaces such as OnNodeUpdatedEventHandler and OnPermissionUpdatedEventHandler.

It reuses the same shared ingestion pipeline as the batch ingester:

  • Fetch the current node snapshot from Alfresco REST API
  • Apply scope and exclusion rules
  • Sync metadata to hxpr
  • Extract text with Transform Service
  • Chunk and embed with Spring AI
  • Update permissions or delete when nodes move out of scope

Permission reconciliation is separate from content updates:

  • Content and scope changes are handled through Event2 live ingestion.
  • Alfresco permission changes should be reconciled through POST /api/sync/permissions in alfresco-batch-ingester because the repository does not reliably emit permission update events.
  • In production, the content-lake-repo-model addon inside Alfresco Repository should detect ACL changes after commit and publish a persistent ActiveMQ queue message. alfresco-batch-ingester consumes that queue and runs the same ACL reconciliation path.
  • If a permission event is emitted, the live ingester can still process it, but that path is best-effort rather than the primary contract.

When the live ingester does receive a permission-related event, it distinguishes between file and folder targets:

  • File-level event: the ACL is updated only for that file (updatePermissions) — no content re-extraction or embedding regeneration.
  • Folder-level event: the live ingester walks the full descendant subtree and applies an ACL-only update to every indexed file beneath the folder. This covers three event types that can signal a folder ACL change: PERMISSION_UPDATED, PEER_ASSOC_CREATED, and PEER_ASSOC_DELETED. A fourth handler (FolderPermissionFallbackHandler) catches NODE_UPDATED events on folders where only the ACL changed (no structural diff), providing a safety net for sources that do not emit a dedicated permission event.

Folder-level propagation behaviour:

  • Descendant files with isInheritanceEnabled: false keep their locally-set ACL unchanged — the folder's new permissions are not pushed down to them.
  • Descendant files with inheritance enabled receive a recomputed ACL derived from the folder's current Alfresco permissions snapshot.
  • Files that fall outside scope after the change are deleted from hxpr rather than updated.
  • The propagation never re-ingests content; it is strictly an ACL patch.

The live path is guarded by the same alfresco_modifiedAt staleness check used by batch ingestion, so batch and live runs can coexist safely.

Status endpoint:

curl http://localhost:9092/api/live/status

Configuration

Ingestion

Edit alfresco/alfresco-batch-ingester/src/main/resources/application.yml:

ingestion:
  sources:
    - folder: your-folder-node-id
      recursive: true
      types: [cm:content]
  exclude:
    paths: ["*/surf-config/*", "*/thumbnails/*"]
    aspects: [cm:workingcopy]

Live Ingestion

Edit alfresco/alfresco-live-ingester/src/main/resources/application.yml:

spring:
  activemq:
    broker-url: ${ACTIVEMQ_URL:tcp://localhost:61616}
    user: ${ACTIVEMQ_USER:admin}
    password: ${ACTIVEMQ_PASSWORD:admin}
  jms:
    cache:
      enabled: false

alfresco:
  events:
    topic-name: ${ALFRESCO_EVENT_TOPIC:alfresco.repo.event2}
    enable-handlers: true
    enable-spring-integration: false

live-ingester:
  filter:
    exclude-paths: ["*/surf-config/*", "*/thumbnails/*"]
    exclude-aspects: [cm:workingcopy]
  scope:
    include-paths: []
    required-aspects: []
  dedup:
    window: ${LIVE_INGESTER_DEDUP_WINDOW:PT2M}
    max-entries: ${LIVE_INGESTER_DEDUP_MAX_ENTRIES:10000}

Notes:

  • spring.jms.cache.enabled=false is required so the Alfresco Java SDK can use the native ActiveMQ connection factory.
  • By default, the live ingester behaves as an exclude-only listener. Set include-paths or required-aspects to narrow the scope.
  • Transform Service receives the original Alfresco filename when available, improving binary format detection during text extraction.

RAG

Edit common/rag-service/src/main/resources/application.yml:

spring:
  ai:
    openai:
      chat:
        options:
          model: ${LLM_MODEL:ai/gpt-oss}
          temperature: ${LLM_TEMPERATURE:0.3}
          maxTokens: ${LLM_MAX_TOKENS:1024}

rag:
  default-top-k: 5
  default-min-score: 0.5
  max-context-length: 12000
  default-system-prompt: >
    You are a document assistant that answers questions based strictly on
    the provided context.

    RULES:
    1. Use ONLY information from the DOCUMENT CONTEXT below. Do not use prior knowledge.
    2. When referencing information, cite the source using its label (e.g. "According to Source 1...").
    3. If multiple sources contain relevant information, synthesize them and cite each.
    4. If the context does not contain enough information to fully answer the question,
    clearly state what you can answer and what is missing.
    5. Be concise and direct. Do not repeat the question or add unnecessary preamble.
  conversation:
    enabled: true
    max-history-turns: 10
    session-ttl-minutes: 30
    query-reformulation: true

semantic-search:
  default-min-score: 0.5

search:
  hybrid:
    enabled: true
    strategy: rrf       # or weighted
    normalization: max  # max or minmax (weighted strategy)
    vector-weight: 0.7
    text-weight: 0.3
    initial-candidates: 20
    final-results: 5
    rrf-k: 60
    default-min-score: 0.0

Conversation memory storage:

  • Default implementation is in-memory.
  • To use Redis or a database, provide a custom Spring bean implementing ConversationMemoryStore; the default in-memory store is only created when no other ConversationMemoryStore bean exists.

Roadmap

Next (Q2 2026 - Open Source Release)

  • Harden live-ingester with end-to-end Event2 coverage and operational guidance
  • OAuth2/Keycloak integration
  • Comprehensive testing suite
  • Production deployment guide

Future

  • Conversation history / multi-turn chat sessions
  • Re-ranking with cross-encoder models
  • Multiple embedding models per document
  • Document versioning support
  • DocFilters integration (better text extraction)
  • Multilingual embeddings
  • Performance optimizations for 10K+ documents

Development

Build

mvn clean package

Run Tests

mvn test

Run Locally

# Alfresco Batch Ingester
mvn spring-boot:run -pl alfresco/alfresco-batch-ingester -am
# or
java -jar alfresco/alfresco-batch-ingester/target/alfresco-batch-ingester-1.0.0-SNAPSHOT.jar

# Alfresco Live Ingester
mvn spring-boot:run -pl alfresco/alfresco-live-ingester -am
# or
java -jar alfresco/alfresco-live-ingester/target/alfresco-live-ingester-1.0.0-SNAPSHOT.jar

# Nuxeo Batch Ingester
mvn spring-boot:run -pl nuxeo/nuxeo-batch-ingester -am
# or
java -jar nuxeo/nuxeo-batch-ingester/target/nuxeo-batch-ingester-1.0.0-SNAPSHOT.jar

# Nuxeo Live Ingester
mvn spring-boot:run -pl nuxeo/nuxeo-live-ingester -am
# or
java -jar nuxeo/nuxeo-live-ingester/target/nuxeo-live-ingester-1.0.0-SNAPSHOT.jar

# RAG Service
mvn spring-boot:run -pl common/rag-service -am
# or
java -jar common/rag-service/target/rag-service-1.0.0-SNAPSHOT.jar

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'feat: add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Acknowledgments

About

App for Hyland Content Lake

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors