CruxGen
Create LLM-ready datasets. Straight from the source.
What is CruxGen?
CruxGen automates the creation of synthetic question-answer datasets from your company's unstructured documents. Upload PDFs and DOCX files, and let AI generate contextually relevant Q&A pairs ready for LLM training.
Problem: Creating training datasets manually is time-intensive and expensive.
Solution: Automated pipeline that processes documents and generates structured Q&A pairs.
Key Features
- Document Processing: Upload and manage PDF/DOCX files
- Intelligent Chunking: Automatically split documents into meaningful segments
- QA Generation: Generate contextual question-answer pairs using LLM
- Export Ready: Download datasets in JSONL format for training
- Enterprise Ready: Vault integration, PostgreSQL storage, MinIO object storage
Quick Start
1. Install the SDK
2. Process Your First Document
from cruxgen_sdk import CruxGenSDK
with CruxGenSDK("http://localhost:8000") as sdk:
# Upload document
result = sdk.upload_file("company-policy.pdf")
file_id = result["response"]
# Create chunks
sdk.create_chunks(file_id, "default-bucket")
# Generate QA pairs
sdk.create_qa_pairs(file_id)
# Export dataset
qa_data = sdk.get_qa_pairs(file_id, generate_jsonl=True)
with open("training_data.jsonl", "wb") as f:
f.write(qa_data)
3. Use Your Dataset
The generated JSONL file contains structured Q&A pairs ready for LLM training:
{"question": "What is the company's remote work policy?", "answer": "Employees may work remotely up to 3 days per week..."}
{"question": "How do I submit vacation requests?", "answer": "Vacation requests must be submitted through the HR portal..."}
Architecture
CruxGen follows a modular pipeline architecture:
- Document Storage → MinIO object storage
- Document Chunking → Docling text splitters
- QA Generation → OpenAI API processing
- Data Management → PostgreSQL with SQLAlchemy
- API Layer → FastAPI with automatic documentation
API Components
Core Services
- Main Application - FastAPI server with health checks
- Document Management - File upload, storage, and metadata
- Chunk Management - Document splitting and chunk operations
- QA Management - Question-answer pair generation
Client SDK
- Python SDK - Complete SDK for CruxGen API
Dependencies
Core Stack:
- FastAPI 0.116.1+ (API framework)
- SQLAlchemy 2.0.43+ (database ORM)
- MinIO 7.2.16+ (object storage)
- PostgreSQL (primary database)
LLM Processing:
- Litellm (llm management)
Infrastructure:
- HashiCorp Vault (secrets management)
- Tenacity 9.1.2+ (retry logic)
- Python 3.11+ required
Getting Started
Prerequisites
- Database: PostgreSQL instance
- Object Storage: MinIO server
- Secrets: HashiCorp Vault
- LLM API: OpenAI API key
Installation
# Clone repository
git clone https://github.com/your-org/cruxgen
cd cruxgen
# Install dependencies
pip install -e .
# Configure environment
cp .env.example .env
# Edit .env with your credentials
# Start server
python main.py
Health Check
Verify all systems are operational:
Workflow
Document to Dataset Pipeline
- Upload documents (PDF/DOCX) to MinIO storage
- Process files into database-tracked chunks
- Generate Q&A pairs using LLM processing
- Export structured datasets for training
Typical Usage Pattern
# 1. Document Management
sdk.create_bucket("documents")
result = sdk.upload_file("manual.pdf", "documents")
file_id = result["response"]
# 2. Content Processing
sdk.create_chunks(file_id, "documents")
chunks = sdk.get_chunks(file_id)
# 3. Dataset Generation
sdk.create_qa_pairs(file_id)
dataset = sdk.get_qa_pairs(file_id, generate_jsonl=True)
# 4. Export and Use
with open("training_set.jsonl", "wb") as f:
f.write(dataset)