Domain-Specific Embedding Model for Washington State Law

Washington-state-law-embedding-model-Large

An advanced, parameter-rich sentence embedding model fine-tuned specifically for the Revised Code of Washington (RCW). Based on BAAI/bge-large, it maps legal queries and statutory text to a 1024-dimensional dense vector space, bridging the gap between conversational "street-language" queries and formal statutory text.

504.1k
RCW context pairs
1024
Vector Dimension
83.5%
Accuracy@10
+58.5%
MRR Improvement

Performance Analysis

This section details the critical retrieval metrics comparing the untrained BAAI/bge-large-en-v1.5 base model against our fine-tuned custom model. Evaluated against a held-out validation set of ~47,000 unique Washington State law queries, fine-tuning achieved the absolute mathematical ceiling for this legal dataset, successfully mapping natural language queries to their preempting statutes.

Retrieval Metrics Comparison

83.54%
Fine-Tuned Recall@10
42.55%
Fine-Tuned Recall@5
24.87%
Fine-Tuned MRR@10
38.28%
Fine-Tuned NDCG@10

Edge-Case Deep Dive

General accuracy is important, but a law enforcement tool must perform reliably under highly specific, granular search conditions. We isolated complex query types (like specific procedures or dense legal tasks) to verify the model's robustness.

🔍

Legal Task (specific_term)

Successfully handles complex statutory terminology with high fidelity across the 1024d space.

Procedure/How-To (specific_task)

Translates actionable, task-oriented questions to relevant RCWs effectively.

📖

General Search (statute_lookup)

Maintains an excellent baseline for broad, topical inquiries about state laws.

Implementation Guide

Integrating this custom model into your Python stack is straightforward via the sentence-transformers library. However, strict adherence to the query formatting rules below is absolutely critical to achieve the benchmarked accuracy.

CRITICAL INSTRUCTION

Because this model inherits the BGE-Large architecture, you MUST prepend a specific instruction to user search queries.

  • Queries: Prepend
    "Represent this sentence for searching relevant passages: "
  • Database Docs: Do NOT add the prefix. Embed raw text only.

Dependencies

pip install -U sentence-transformers
Python Usage Example
import torch
from sentence_transformers import SentenceTransformer, util

# 1. Load the fine-tuned large model
model = SentenceTransformer('CSI-lab/Washington-state-law-embedding-model-Large')

# 2. Define the laws (Your Vector Database)
laws = [
    "RCW 9A.56.030: Theft in the first degree. A person is guilty of theft in the first degree if he or she commits theft of property or services which exceed(s) five thousand dollars in value.",
    "RCW 46.61.502: Driving under the influence. A person is guilty of driving while under the influence of intoxicating liquor...",
    "RCW 9A.36.011: Assault in the first degree. A person is guilty of assault in the first degree if he or she..."
]

# 3. Define the user's search query
user_query = "What dollar amount makes a theft a first degree felony?"

# 4. CRITICAL: Add the required BGE prefix to the query ONLY
query_prefix = "Represent this sentence for searching relevant passages: "
formatted_query = query_prefix + user_query

# 5. Encode the documents and the query
law_embeddings = model.encode(laws, convert_to_tensor=True)
query_embedding = model.encode(formatted_query, convert_to_tensor=True)

# 6. Calculate Cosine Similarity
cosine_scores = util.cos_sim(query_embedding, law_embeddings)

# 7. Print the top result
best_idx = cosine_scores.argmax().item()
print(f"Top Match: {laws[best_idx]}")
print(f"Similarity Score: {cosine_scores[0][best_idx]:.4f}")