Upload files, ask questions, and get AI‑backed answers with citations. Compare, synthesize, and export results for teams and polished outputs.

At a Glance:

Corpus is an open-source document Q&A tool that lets users upload PDFs and web pages, ask questions in natural language, and receive answers with source citations through an AI-powered retrieval pipeline.

Overview:

Corpus is a document Q&A application powered by large language models. It allows users to upload PDFs and web pages, organize them into workspaces and document sets, and then ask questions in natural language against their uploaded content. Answers include citations that link directly back to the source text. The project provides a FastAPI backend, a React frontend built with Vite, and relies on PostgreSQL, Elasticsearch, RabbitMQ, Temporal, and Redis for its processing pipeline. It supports multiple LLM providers including OpenAI, Anthropic, Google, and xAI, as well as embedding providers from OpenAI and Voyage AI.

Key Decision Points:

  • Self-managed stack: The architecture requires PostgreSQL, Elasticsearch, RabbitMQ, Temporal, and Redis to run, making it suitable for teams comfortable operating multiple infrastructure components.

  • Document and web ingestion: Users can upload PDFs and web pages directly, with no mention of other file types or data sources in the README.

  • Citation-backed answers: Answers are explicitly linked to source text through citations, which supports verification workflows.

  • Multi-model LLM support: The tool is not locked to a single provider and can be configured with OpenAI, Anthropic, Google, or xAI models based on user preference or access.

Core Features:

  • Multi-format document upload: Supports uploading PDFs and web pages for content extraction and indexing.

  • Natural language Q&A: Users can ask questions in plain language and receive AI-generated answers based on uploaded documents.

  • Source citations: Every answer includes citations that link directly back to the original source text for fact-checking and traceability.

  • Workspaces and document sets: Documents can be organized into separate workspaces and grouped into document sets for structured access.

  • Pluggable LLM and embedding providers: Ships with support for multiple model providers including OpenAI, Anthropic, Google, and xAI for text generation, and OpenAI and Voyage AI for embeddings.

Use Cases:

  • Document analyst Q&A: An analyst can upload a set of PDF reports and web articles, then ask specific questions about their content and receive cited answers without reading each document in full.

  • Research material retrieval: A researcher can organize papers and web sources into themed document sets, then query across them to locate specific claims or data points with source verification.

Open-Source Alternative Value:

Corpus provides a self-managed alternative to closed-source document Q&A services, with full visibility into its processing architecture and model routing. Users can choose which LLM and embedding providers to use instead of being bound to a single vendor's model stack. The component-based architecture using FastAPI, PostgreSQL, Elasticsearch, RabbitMQ, Temporal, and Redis means operators can inspect and tune each part of the retrieval and answer generation pipeline.

PartagerXLinkedInReddit

Outils associés

Statistiques du projet

Étoiles

13

Forks

0

Licence

AGPL-3.0

Métadonnées

Alternative à
Hebbia