No description
- Employment Rights Act 1996 (1007 KB)
- Patents Act 1977 (445 KB)
- Trade Secrets Enforcement Regulations 2018 (18 KB)
- Social Security Intermediaries Regulations 2000 (19 KB)
- Database Rights Regulations 1997 (18 KB)
Integration includes:
* Copied source files to raw/uk/{employment,ip,tax}/
* Added rows to manifests/download_manifest.csv with SHA-256 hashes
* Updated logs/download_log.csv with ingestion timestamps
* Ingested documents into Chroma vector database (5 docs -> 28 vectors)
* Updated ROADMAP.md with UK completion statistics
* Updated README.md with setup, ingestion, and structure documentation
All UK P0 documents now complete for freelancer contract analysis.
Chroma collection vectors increased from 5,262 to 5,290.
|
||
|---|---|---|
| indexes/chromadb | ||
| logs | ||
| manifests | ||
| raw | ||
| scripts | ||
| CLOUD_SESSION_LEGAL_DB_BUILD.md | ||
| LEGAL_CORPUS_IMPORT_LIST.md | ||
| README.md | ||
| requirements.txt | ||
| ROADMAP.md | ||
if-legal-corpus
Self-hosted legal document corpus for contract analysis and freelancer rights research. Contains international legislation, case law, and industry standards indexed in ChromaDB for semantic search.
Quick Start
Setup
# Clone repository (already done)
cd /home/setup/if-legal-corpus
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
Running Downloads
To download and process documents from the inventory:
# Download all documents from LEGAL_CORPUS_IMPORT_LIST.md
python scripts/download_all.py
# This will:
# 1. Read LEGAL_CORPUS_IMPORT_LIST.md
# 2. Download each document to raw/ subdirectories
# 3. Log downloads to logs/download_log.csv
# 4. Record manifest entries in manifests/download_manifest.csv
Running Chroma Ingestion
To index downloaded documents into ChromaDB for semantic search:
# Ingest documents into Chroma vector database
python scripts/ingest_chromadb.py \
--manifest manifests/download_manifest.csv \
--db-dir indexes/chromadb
# This will:
# 1. Read all documents with status=success from manifest
# 2. Extract text (supports .md, .pdf, .html, .json, .xml)
# 3. Chunk text (1500 chars per chunk, 200 char overlap)
# 4. Generate embeddings (using Chroma's default embedding)
# 5. Store vectors with metadata in indexes/chromadb/
Corpus Structure
Raw Documents (raw/)
raw/
├── uk/ # United Kingdom legislation (P0: COMPLETE)
│ ├── employment/ # Employment law
│ │ └── Employment_Rights_Act_1996.md
│ ├── ip/ # Intellectual property
│ │ ├── Patents_Act_1977.md
│ │ ├── Trade_Secrets_Enforcement_Regulations_2018.md
│ │ └── Copyright_Rights_Databases_Regulations_1997.md
│ └── tax/ # Tax & IR35 regulations
│ └── Social_Security_Intermediaries_Regulations_2000.md
├── us_federal/ # US federal statutes and regulations
├── us_state/ # State-level law (CA, NY, TX, etc.)
├── eu/ # EU directives and regulations
├── germany/ # German civil code (BGB)
├── france/ # French labor and IP law
├── canada/ # Canadian federal and provincial law
├── australia/ # Australian legislation
├── datasets/ # Pre-labeled contract datasets (CUAD, etc.)
├── caselaw/ # Landmark cases (CourtListener, etc.)
└── industry/ # Industry standards (AIGA, GAG, IGDA, etc.)
Manifest & Logs (manifests/ and logs/)
- manifests/download_manifest.csv - Complete inventory with download status, file sizes, and SHA-256 hashes
- logs/download_log.csv - Timestamped download operations
Vector Database (indexes/chromadb/)
- Chroma persistent storage - 5,290 vectors from all ingested documents
- Collection name:
if_legal_corpus - Metadata per vector:
inventory_path,document_name,local_path,sha256
UK P0 Documents - Integration Complete
All 5 critical UK documents for freelancer contract analysis have been integrated (2025-11-28):
| Document | Category | Size | Vectors | Status |
|---|---|---|---|---|
| Employment Rights Act 1996 | Employment law | 1,031 KB | 6 | ✓ Complete |
| Patents Act 1977 | IP law | 455 KB | 3 | ✓ Complete |
| Trade Secrets Enforcement Regulations 2018 | IP law | 18 KB | 1 | ✓ Complete |
| Social Security (Intermediaries) Regulations 2000 | IR35 Tax | 19 KB | 1 | ✓ Complete |
| Database Rights Regulations 1997 | IP law | 18 KB | 1 | ✓ Complete |
Total UK collection: 12 documents, 12+ vectors, 16 MB
Corpus Coverage Summary
- Total Inventory Items: 153
- Successfully Downloaded: 91 (59%)
- Download Errors: 45 (29%)
- No Direct Link: 17 (11%)
- Chroma Vectors: 5,290
- Raw Size: ~16 MB
Important Notes
Chroma Database Size
The ChromaDB indexes are large and should not be committed to git:
- Add
indexes/chromadb/to.gitignore - Regenerate vectors locally using
scripts/ingest_chromadb.py
Virtual Environment
The .venv/ directory is excluded from git. Install dependencies locally after cloning.
Document Formats Supported
- Markdown (
.md) - PDF (
.pdf) - HTML (
.html) - XML (
.xml) - JSON (
.json) - Plain text
Chunking Strategy
- Default chunk size: 1,500 characters
- Overlap: 200 characters
- Configurable in
scripts/ingest_chromadb.pyviachunk_text()function
License & Attribution
This corpus includes:
- Public domain legislation from UK, US, EU, Canada, Australia
- Creative commons licensed industry standards (AIGA, GAG, IGDA)
- Dataset acknowledgments: CUAD, ContractNLI, LEDGAR
See individual source documents for their specific licensing terms.