Document processing is the bridge between an uploaded file and the agent's context window. You upload the file once via the standard data API, then reference it by storage path in the chat request. Inside the agent, DocumentProcessorService extracts text, attaches metadata, and feeds it into the prompt before the LLM call.
Endpoints
/repository/create/uploadJWT/ai/agent/streamJWT/ai/chatJWTSupported formats
| Format | Notes |
|---|---|
Extracted via pdf-parse. Page-aware — the metadata includes pageCount. | |
| DOCX / DOC | Extracted via mammoth. Word count tracked in metadata. |
| PPTX / PPT | Slide-by-slide extraction. Metadata includes slideCount. |
Other formats (XLSX, CSV, TXT, MD) flow through as raw text without a special extractor.
Upload
Upload using the standard file endpoint. The server returns a storage path you reference in subsequent calls.
const fd = new FormData();
fd.append('file', pdfFile, 'report.pdf');
const { path, url, size } = await fetch('/data/upload', {
method: 'POST',
headers: { orgid: 'my-org', Authorization: `Bearer ${jwt}` },
body: fd,
}).then(r => r.json());
The returned path is the in-bucket key you'll pass to the AI controller.
Reference in a chat request
await fetch('/ai/agent/stream', {
method: 'POST',
headers: {
orgid: 'my-org',
Authorization: `Bearer ${jwt}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
task: 'Summarise the key findings in this report and pull out any numbers above $1M.',
files: [
{ path, name: 'report.pdf', size },
],
agentRole: 'chat',
conversationId: 'conv-1',
}),
});
Inside the agent:
- 1
Fetch the file
DocumentProcessorService.processDocument(orgId, path, name)retrieves the buffer from the configured storage provider (S3, GCS, local). - 2
Extract text
Routed by extension —
pdf-parse,mammoth, or PPTX extractor. Returns{ text, metadata }. - 3
Build prompt context
The extracted text is prepended to the user's task as a context block. Long documents are chunked.
- 4
Stream the LLM call
The agent runs as normal. The document is referenced in conversation history so follow-up questions about the same document don't need to re-upload.
Streaming chunks
During processing the SSE stream emits document-specific chunks before the regular text response:
{ "type": "doc", "stage": "fetch", "fileName": "report.pdf" }
{ "type": "doc", "stage": "extract_started", "fileName": "report.pdf" }
{ "type": "doc", "stage": "extract_completed", "fileName": "report.pdf", "metadata": { "pageCount": 24, "wordCount": 6210 } }
{ "type": "text", "text": "The report identifies three key findings..." }
Use these to render an "Analysing document..." indicator before the actual answer starts.
Returned metadata
type DocumentProcessingResult = {
success: boolean;
text: string;
metadata: {
fileName: string;
fileType: string; // 'pdf' | 'docx' | 'pptx' | ...
fileSize: number;
pageCount?: number;
slideCount?: number;
wordCount: number;
extractedAt: string;
};
error?: string;
};
The metadata is included in the tool_result chunk if you call DocumentProcessorService directly through MCP. For a chat agent run, the metadata is summarised in a doc chunk and used to size the chunking strategy.
RAG pattern
For documents larger than the model's context window, AppEngine chunks the extracted text and runs a retrieval step before the main agent call. The default chunker:
- Splits the document into ~800-token windows with 100-token overlap.
- Embeds each chunk via the configured embeddings provider.
- Stores embeddings in the org's vector index (Redis-backed by default).
- At query time, retrieves the top-k chunks most similar to the user's task and prepends them as context.
Configure k and chunk size by passing settings.rag in the request:
await fetch('/ai/agent/stream', {
method: 'POST',
headers: { /* ... */ },
body: JSON.stringify({
task: 'What does the report say about Q3 revenue?',
files: [{ path: 'uploads/report.pdf', name: 'report.pdf' }],
settings: {
rag: {
chunkSize: 1200,
chunkOverlap: 200,
topK: 5,
},
},
}),
});
For documents under the chunk threshold (~80% of context window), RAG is skipped and the full text is included.
Re-using a processed document
The first time a document is processed, the extracted text and embeddings are cached under the org's storage key. Subsequent runs that reference the same path skip extraction and re-use the cached chunks. Cache eviction is by file modification time — if the underlying file changes, the extraction re-runs.
For long-running conversations referring back to a document many turns in:
// Turn 1: include files
{ task: 'Summarise...', files: [{ path: 'uploads/r.pdf', name: 'r.pdf' }], conversationId: 'c1' }
// Turn 2: don't repeat files; the agent's conversationHistory carries the doc reference
{ task: 'Now drill into Q3.', conversationId: 'c1' }
The agent uses the cached chunks for the second turn without re-fetching.
For images attached to a request (images: [...]), the path is different — vision-capable models receive the image directly as part of the message content. There's no extraction step. See Chat agent streaming.