Skip to main content

Knowledge Base

The Knowledge Base lets you upload documents and query them via semantic search using RAG (Retrieval-Augmented Generation). Documents are chunked, embedded, and indexed for fast similarity search.

Uploading Documents

MCP Tool: upload_document (Profile: core)

upload_document({
filePath: "/path/to/document.pdf",
title: "Product Manual 2026",
description: "Optional description"
})

Supported formats: PDF, HTML (.html, .htm), plain text (.txt), Markdown (.md).

  • PDF: text is extracted automatically before chunking.
  • HTML: tags are stripped, preserving text structure. Navigation elements (<nav>, <header>, <footer>, <aside>, <svg>, <form>, <button>) and script/style blocks are removed.
  • Text/Markdown: chunked and indexed directly.

After upload, the document is automatically chunked and indexed in the background. Status transitions from processingready.

URL Ingestion

You can ingest content from a URL. Fyso will download the page, extract clean text (stripping HTML navigation/chrome), and index it:

upload_document({
title: "Company Policy",
content: "https://example.com/policy",
source_type: "url"
})

REST API:

curl -X POST https://api.fyso.dev/api/knowledge/documents \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{
"title": "Company Policy",
"content": "https://example.com/policy",
"source_type": "url"
}'
  • Only text-based resources are supported (HTML, plain text, JSON, XML)
  • 15-second timeout on URL fetch
  • SSRF protection: private/internal IPs are blocked

Binary PDF Upload (REST API)

To upload a PDF file directly from your backend or CI pipeline, use the multipart endpoint:

POST /api/knowledge/documents/upload
Authorization: Bearer <token>
Content-Type: multipart/form-data

curl -X POST https://api.fyso.dev/api/knowledge/documents/upload \
-H "Authorization: Bearer <token>" \
-F "file=@/path/to/manual.pdf" \
-F "title=Product Manual 2026"
FieldTypeRequiredDescription
filebinaryYesPDF file (application/pdf only, max 20 MB)
titlestringNoDocument title. Defaults to the filename.

Returns 201 on success with the document metadata.

Errors:

CodeDescription
400Missing file field or unsupported MIME type (only PDF accepted)
403Plan document or storage limit reached

Plan limits

PlanDocumentsStorage
Free105 MB
Pro1,0001 GB

Searching Documents

MCP Tool: search_knowledge (Profile: core)

search_knowledge({
query: "How do I reset the device?",
limit: 5,
threshold: 0.3,
one_per_document: true
})

Returns matching chunks with source document, relevance score, and content excerpt. Every search is tracked for analytics (see Stats).

Parameters

ParameterTypeDefaultDescription
querystringrequiredNatural language search query
limitnumber10Maximum results (max 50)
thresholdnumber0.3Minimum similarity score 0-1. Lower = more results
one_per_documentbooleanfalseReturn only the best chunk per document
document_idsstring[]allRestrict search to specific documents
Search Tips

Search works by meaning, not exact keywords. Instead of a single word like "price", try "what is the product price" or "information about pricing". The more you describe what you're looking for, the better the results.

REST API:

curl -X POST https://api.fyso.dev/api/knowledge/search \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{
"query": "How do I reset the device?",
"limit": 5,
"threshold": 0.3,
"one_per_document": true
}'

Response:

{
"success": true,
"data": {
"results": [
{
"content": "To reset, hold the power button for 10 seconds...",
"score": 0.92,
"document": { "id": "...", "title": "Product Manual 2026", "source_type": "file" },
"chunk_index": 3,
"token_count": 145
}
],
"query_time_ms": 45
}
}

Listing Documents

MCP Tool: list_documents (Profile: core)

Lists all documents in the tenant with metadata (title, upload date, chunk count, indexing status).

Filter by status: GET /api/knowledge/documents?status=ready

Getting a Document

MCP Tool: get_document (Profile: core)

get_document({ documentId: "uuid" })

Returns document metadata, content, and a preview of the first 5 chunks.

Deleting Documents

MCP Tool: delete_document (Profile: advanced)

delete_document({ documentId: "uuid" })

Removes the document and all its indexed chunks. A knowledge_delete event is tracked for analytics.

Stats

MCP Tool: get_knowledge_stats (Profile: core)

Returns indexing statistics, search analytics, and embedding usage:

GET /api/knowledge/stats
{
"documents": {
"total": 42,
"ready": 40,
"processing": 1,
"error": 1
},
"chunks": {
"total": 1820,
"avg_per_document": 43
},
"tokens": {
"total": 218400,
"avg_per_chunk": 120
},
"storage_bytes": 4718592,
"by_type": {
"application/pdf": 30,
"text/html": 10,
"text/plain": 2
},
"search": {
"total_queries_30d": 156,
"avg_latency_ms": 52,
"avg_score": 0.84,
"zero_result_rate": 0.06,
"coverage_score": 0.94
},
"embedding_usage_30d": {
"search_tokens": 3200,
"ingest_tokens": 45000,
"total_tokens": 48200,
"total_ingests": 42,
"avg_ingest_ms": 1250
},
"top_documents": [
{ "id": "...", "title": "Product Manual 2026", "hit_count": 48 }
]
}

Stats fields

FieldDescription
search.total_queries_30dNumber of searches in the last 30 days
search.avg_scoreAverage top relevance score
search.zero_result_rateFraction of queries that returned no results
search.coverage_scoreFraction of queries that returned at least one result
embedding_usage_30d.search_tokensOpenAI embedding tokens used for search queries
embedding_usage_30d.ingest_tokensOpenAI embedding tokens used for document ingestion
embedding_usage_30d.total_tokensTotal embedding tokens (search + ingest)
embedding_usage_30d.avg_ingest_msAverage document processing time

Event Tracking

All knowledge base operations are tracked via events for analytics and billing:

EventTracked Data
knowledge_ingestdocument_id, title, source_type, mime_type, original_size_bytes, chunk_count, total_tokens, embedding_tokens_used, processing_ms
knowledge_searchquery, result_count, top_score, latency_ms, document_ids_hit, embedding_tokens_used
knowledge_deletedocument_id, title, source_type, chunk_count, total_tokens, original_size_bytes

Storage Usage

To get a breakdown of knowledge base storage for monitoring or billing purposes:

GET /api/usage/storage
Authorization: Bearer <token>
{
"success": true,
"data": {
"db": {
"bytes": 8388608,
"table_count": 12,
"estimated_rows": 347
},
"knowledge_base": {
"bytes": 512000,
"documents": 3
},
"bucket": {
"bytes": 0,
"file_count": 0
},
"total_bytes": 8388608
}
}
  • db.bytes — total PostgreSQL storage for all tenant tables (exact)
  • db.estimated_rows — estimated row count from PostgreSQL statistics (approximate)
  • knowledge_base.bytes — sum of original file sizes for all documents
  • bucket — file storage used (stub; returns 0 in the current release)
  • total_bytes — db + bucket (knowledge_base not included in total)

Dashboard

From the admin panel, go to Knowledge in the sidebar to manage your knowledge base visually:

  • Stats bar — document count, storage used, total chunks
  • Document list — PDF badge for PDF files, status badge (ready/processing/error), file size, content preview, delete button
  • Add document panel — text tab (title + content), URL tab (fetches and indexes the page), or file upload tab (PDF)
  • Search panel — enter a query, adjust precision slider, toggle fragments/one-per-doc, see results with certainty bar
  • Help modal — explains search options and how to search effectively
  • Usage page — storage breakdown by file type (PDF, Text, Markdown, HTML)

Use Cases

  • Support chatbots: Index FAQ documents, answer user questions with search_knowledge
  • Internal wikis: Upload policies and procedures, let agents surface relevant content
  • Product documentation: Augment business rules with external knowledge
  • Web content: Ingest pages via URL, automatically cleaned of navigation/chrome