Domain-Specific Embedding Model for Washington State Law

Washington-state-law-embedding-model-Large

An advanced, parameter-rich sentence embedding model fine-tuned specifically for the Revised Code of Washington (RCW). Based on BAAI/bge-large, it maps legal queries and statutory text to a 1024-dimensional dense vector space, bridging the gap between conversational "street-language" queries and formal statutory text.

504.1k

RCW context pairs

1024

Vector Dimension

83.5%

Accuracy@10

+58.5%

MRR Improvement

Hugging Face

Download Large Model ›

Performance Analysis

This section details the critical retrieval metrics comparing the untrained BAAI/bge-large-en-v1.5 base model against our fine-tuned custom model. Evaluated against a held-out validation set of ~47,000 unique Washington State law queries, fine-tuning achieved the absolute mathematical ceiling for this legal dataset, successfully mapping natural language queries to their preempting statutes.

Retrieval Metrics Comparison

83.54%

Fine-Tuned Recall@10

42.55%

Fine-Tuned Recall@5

24.87%

Fine-Tuned MRR@10

38.28%

Fine-Tuned NDCG@10

Edge-Case Deep Dive

General accuracy is important, but a law enforcement tool must perform reliably under highly specific, granular search conditions. We isolated complex query types (like specific procedures or dense legal tasks) to verify the model's robustness.

🔍

Legal Task (specific_term)

Successfully handles complex statutory terminology with high fidelity across the 1024d space.

⚙

Procedure/How-To (specific_task)

Translates actionable, task-oriented questions to relevant RCWs effectively.

📖

General Search (statute_lookup)

Maintains an excellent baseline for broad, topical inquiries about state laws.

Implementation Guide

Integrating this custom model into your Python stack is straightforward via the sentence-transformers library. However, strict adherence to the query formatting rules below is absolutely critical to achieve the benchmarked accuracy.

⚠ CRITICAL INSTRUCTION

Because this model inherits the BGE-Large architecture, you MUST prepend a specific instruction to user search queries.

Queries: Prepend
"Represent this sentence for searching relevant passages: "
Database Docs: Do NOT add the prefix. Embed raw text only.

Dependencies

pip install -U sentence-transformers

Python Usage Example

import torch
from sentence_transformers import SentenceTransformer, util

# 1. Load the fine-tuned large model
model = SentenceTransformer('CSI-lab/Washington-state-law-embedding-model-Large')

# 2. Define the laws (Your Vector Database)
laws = [
    "RCW 9A.56.030: Theft in the first degree. A person is guilty of theft in the first degree if he or she commits theft of property or services which exceed(s) five thousand dollars in value.",
    "RCW 46.61.502: Driving under the influence. A person is guilty of driving while under the influence of intoxicating liquor...",
    "RCW 9A.36.011: Assault in the first degree. A person is guilty of assault in the first degree if he or she..."
]

# 3. Define the user's search query
user_query = "What dollar amount makes a theft a first degree felony?"

# 4. CRITICAL: Add the required BGE prefix to the query ONLY
query_prefix = "Represent this sentence for searching relevant passages: "
formatted_query = query_prefix + user_query

# 5. Encode the documents and the query
law_embeddings = model.encode(laws, convert_to_tensor=True)
query_embedding = model.encode(formatted_query, convert_to_tensor=True)

# 6. Calculate Cosine Similarity
cosine_scores = util.cos_sim(query_embedding, law_embeddings)

# 7. Print the top result
best_idx = cosine_scores.argmax().item()
print(f"Top Match: {laws[best_idx]}")
print(f"Similarity Score: {cosine_scores[0][best_idx]:.4f}")