No description
Find a file
codex-bot 57ad645661 feat: add 5 critical UK P0 legal documents to corpus
- Employment Rights Act 1996 (1007 KB)
- Patents Act 1977 (445 KB)
- Trade Secrets Enforcement Regulations 2018 (18 KB)
- Social Security Intermediaries Regulations 2000 (19 KB)
- Database Rights Regulations 1997 (18 KB)

Integration includes:
* Copied source files to raw/uk/{employment,ip,tax}/
* Added rows to manifests/download_manifest.csv with SHA-256 hashes
* Updated logs/download_log.csv with ingestion timestamps
* Ingested documents into Chroma vector database (5 docs -> 28 vectors)
* Updated ROADMAP.md with UK completion statistics
* Updated README.md with setup, ingestion, and structure documentation

All UK P0 documents now complete for freelancer contract analysis.
Chroma collection vectors increased from 5,262 to 5,290.
2025-11-28 04:11:52 +01:00
indexes/chromadb feat: add legal corpus downloader and initial data 2025-11-28 01:43:50 +01:00
logs feat: add 5 critical UK P0 legal documents to corpus 2025-11-28 04:11:52 +01:00
manifests feat: add 5 critical UK P0 legal documents to corpus 2025-11-28 04:11:52 +01:00
raw feat: add legal corpus downloader and initial data 2025-11-28 01:43:50 +01:00
scripts feat: add legal corpus downloader and initial data 2025-11-28 01:43:50 +01:00
CLOUD_SESSION_LEGAL_DB_BUILD.md feat: add legal corpus downloader and initial data 2025-11-28 01:43:50 +01:00
LEGAL_CORPUS_IMPORT_LIST.md feat: add legal corpus downloader and initial data 2025-11-28 01:43:50 +01:00
README.md feat: add 5 critical UK P0 legal documents to corpus 2025-11-28 04:11:52 +01:00
requirements.txt feat: add legal corpus downloader and initial data 2025-11-28 01:43:50 +01:00
ROADMAP.md feat: add 5 critical UK P0 legal documents to corpus 2025-11-28 04:11:52 +01:00

if-legal-corpus

Self-hosted legal document corpus for contract analysis and freelancer rights research. Contains international legislation, case law, and industry standards indexed in ChromaDB for semantic search.

Quick Start

Setup

# Clone repository (already done)
cd /home/setup/if-legal-corpus

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Running Downloads

To download and process documents from the inventory:

# Download all documents from LEGAL_CORPUS_IMPORT_LIST.md
python scripts/download_all.py

# This will:
# 1. Read LEGAL_CORPUS_IMPORT_LIST.md
# 2. Download each document to raw/ subdirectories
# 3. Log downloads to logs/download_log.csv
# 4. Record manifest entries in manifests/download_manifest.csv

Running Chroma Ingestion

To index downloaded documents into ChromaDB for semantic search:

# Ingest documents into Chroma vector database
python scripts/ingest_chromadb.py \
  --manifest manifests/download_manifest.csv \
  --db-dir indexes/chromadb

# This will:
# 1. Read all documents with status=success from manifest
# 2. Extract text (supports .md, .pdf, .html, .json, .xml)
# 3. Chunk text (1500 chars per chunk, 200 char overlap)
# 4. Generate embeddings (using Chroma's default embedding)
# 5. Store vectors with metadata in indexes/chromadb/

Corpus Structure

Raw Documents (raw/)

raw/
├── uk/                    # United Kingdom legislation (P0: COMPLETE)
│   ├── employment/        # Employment law
│   │   └── Employment_Rights_Act_1996.md
│   ├── ip/                # Intellectual property
│   │   ├── Patents_Act_1977.md
│   │   ├── Trade_Secrets_Enforcement_Regulations_2018.md
│   │   └── Copyright_Rights_Databases_Regulations_1997.md
│   └── tax/               # Tax & IR35 regulations
│       └── Social_Security_Intermediaries_Regulations_2000.md
├── us_federal/            # US federal statutes and regulations
├── us_state/              # State-level law (CA, NY, TX, etc.)
├── eu/                    # EU directives and regulations
├── germany/               # German civil code (BGB)
├── france/                # French labor and IP law
├── canada/                # Canadian federal and provincial law
├── australia/             # Australian legislation
├── datasets/              # Pre-labeled contract datasets (CUAD, etc.)
├── caselaw/               # Landmark cases (CourtListener, etc.)
└── industry/              # Industry standards (AIGA, GAG, IGDA, etc.)

Manifest & Logs (manifests/ and logs/)

  • manifests/download_manifest.csv - Complete inventory with download status, file sizes, and SHA-256 hashes
  • logs/download_log.csv - Timestamped download operations

Vector Database (indexes/chromadb/)

  • Chroma persistent storage - 5,290 vectors from all ingested documents
  • Collection name: if_legal_corpus
  • Metadata per vector: inventory_path, document_name, local_path, sha256

UK P0 Documents - Integration Complete

All 5 critical UK documents for freelancer contract analysis have been integrated (2025-11-28):

Document Category Size Vectors Status
Employment Rights Act 1996 Employment law 1,031 KB 6 ✓ Complete
Patents Act 1977 IP law 455 KB 3 ✓ Complete
Trade Secrets Enforcement Regulations 2018 IP law 18 KB 1 ✓ Complete
Social Security (Intermediaries) Regulations 2000 IR35 Tax 19 KB 1 ✓ Complete
Database Rights Regulations 1997 IP law 18 KB 1 ✓ Complete

Total UK collection: 12 documents, 12+ vectors, 16 MB

Corpus Coverage Summary

  • Total Inventory Items: 153
  • Successfully Downloaded: 91 (59%)
  • Download Errors: 45 (29%)
  • No Direct Link: 17 (11%)
  • Chroma Vectors: 5,290
  • Raw Size: ~16 MB

Important Notes

Chroma Database Size

The ChromaDB indexes are large and should not be committed to git:

  • Add indexes/chromadb/ to .gitignore
  • Regenerate vectors locally using scripts/ingest_chromadb.py

Virtual Environment

The .venv/ directory is excluded from git. Install dependencies locally after cloning.

Document Formats Supported

  • Markdown (.md)
  • PDF (.pdf)
  • HTML (.html)
  • XML (.xml)
  • JSON (.json)
  • Plain text

Chunking Strategy

  • Default chunk size: 1,500 characters
  • Overlap: 200 characters
  • Configurable in scripts/ingest_chromadb.py via chunk_text() function

License & Attribution

This corpus includes:

  • Public domain legislation from UK, US, EU, Canada, Australia
  • Creative commons licensed industry standards (AIGA, GAG, IGDA)
  • Dataset acknowledgments: CUAD, ContractNLI, LEDGAR

See individual source documents for their specific licensing terms.