Domain-Specific Embedding Model for Washington State Law

Washington-state-law-embedding-model-Base

An advanced sentence embedding model fine-tuned specifically for the Revised Code of Washington (RCW). It maps legal queries and statutory text to a 768-dimensional dense vector space, bridging the gap between conversational "street-language" queries and formal statutory text.

504.1k

RCW context pairs

768

Vector Dimension

84.4%

Accuracy@10

+72%

MRR Improvement

Hugging Face

Download ›

Performance Analysis

This section details the critical retrieval metrics comparing the original BAAI/bge-base-en-v1.5 model against our fine-tuned custom model. Evaluated against a held-out validation set of ~47,000 unique Washington State law queries, the fine-tuning process resulted in massive improvements in recall and ranking accuracy, proving its efficacy as the backbone for legal RAG pipelines.

Retrieval Metrics Comparison

84.4%

Fine-Tuned Recall@10

43.17%

Fine-Tuned Recall@5

25.03%

Fine-Tuned MRR@10

38.41%

Fine-Tuned NDCG@10

Edge-Case Deep Dive

General accuracy is important, but a law enforcement tool must perform reliably under highly specific, granular search conditions. We isolated complex query types (like specific procedures or dense legal tasks) to verify the model's robustness.

🔍

Legal Task (specific_term)

Successfully handles complex statutory terminology with 94.87% accuracy.

⚙

Procedure/How-To (specific_task)

Translates actionable, task-oriented questions to relevant RCWs with 94.74% accuracy.

📖

General Search (statute_lookup)

Maintains a 90.00% baseline for broad, topical inquiries about state laws.

Implementation Guide

Integrating this custom model into your Python stack is straightforward via the sentence-transformers library. However, strict adherence to the query formatting rules below is absolutely critical to achieve the benchmarked accuracy.

⚠ CRITICAL INSTRUCTION

Because this model inherits the BGE architecture, you MUST prepend a specific instruction to user search queries.

Queries: Prepend
"Represent this sentence for searching relevant passages: "
Database Docs: Do NOT add the prefix. Embed raw text only.

Dependencies

pip install -U sentence-transformers

Python Usage Example

import torch
from sentence_transformers import SentenceTransformer, util

# 1. Load the fine-tuned model
model = SentenceTransformer('./washington-state-law-embedding-model')

# 2. Define the laws (Your Vector Database)
laws = [
    "RCW 9A.56.030: Theft in the first degree. A person is guilty of theft in the first degree if he or she commits theft of property or services which exceed(s) five thousand dollars in value.",
    "RCW 46.61.502: Driving under the influence. A person is guilty of driving while under the influence of intoxicating liquor...",
    "RCW 9A.36.011: Assault in the first degree. A person is guilty of assault in the first degree if he or she..."
]

# 3. Define the user's search query
user_query = "What dollar amount makes a theft a first degree felony?"

# 4. CRITICAL: Add the required BGE prefix to the query ONLY
query_prefix = "Represent this sentence for searching relevant passages: "
formatted_query = query_prefix + user_query

# 5. Encode the documents and the query
law_embeddings = model.encode(laws, convert_to_tensor=True)
query_embedding = model.encode(formatted_query, convert_to_tensor=True)

# 6. Calculate Cosine Similarity
cosine_scores = util.cos_sim(query_embedding, law_embeddings)

# 7. Print the top result
best_idx = cosine_scores.argmax().item()
print(f"Top Match: {laws[best_idx]}")
print(f"Similarity Score: {cosine_scores[0][best_idx]:.4f}")