Domain-Specific Embedding Model for Washington State Law
Washington-state-law-embedding-model-Base
An advanced sentence embedding model fine-tuned specifically for the Revised Code of Washington (RCW). It maps legal queries and statutory text to a 768-dimensional dense vector space, bridging the gap between conversational "street-language" queries and formal statutory text.
Performance Analysis
This section details the critical retrieval metrics comparing the original BAAI/bge-base-en-v1.5 model against our fine-tuned custom model. Evaluated against a held-out validation set of ~47,000 unique Washington State law queries, the fine-tuning process resulted in massive improvements in recall and ranking accuracy, proving its efficacy as the backbone for legal RAG pipelines.
Retrieval Metrics Comparison
Edge-Case Deep Dive
General accuracy is important, but a law enforcement tool must perform reliably under highly specific, granular search conditions. We isolated complex query types (like specific procedures or dense legal tasks) to verify the model's robustness.
Legal Task (specific_term)
Successfully handles complex statutory terminology with 94.87% accuracy.
Procedure/How-To (specific_task)
Translates actionable, task-oriented questions to relevant RCWs with 94.74% accuracy.
General Search (statute_lookup)
Maintains a 90.00% baseline for broad, topical inquiries about state laws.
Implementation Guide
Integrating this custom model into your Python stack is straightforward via the
sentence-transformers library. However, strict adherence to the query formatting rules
below is absolutely critical to achieve the benchmarked accuracy.
⚠ CRITICAL INSTRUCTION
Because this model inherits the BGE architecture, you MUST prepend a specific instruction to user search queries.
- Queries: Prepend
"Represent this sentence for searching relevant passages: " - Database Docs: Do NOT add the prefix. Embed raw text only.
Dependencies
pip install -U sentence-transformers
import torch from sentence_transformers import SentenceTransformer, util # 1. Load the fine-tuned model model = SentenceTransformer('./washington-state-law-embedding-model') # 2. Define the laws (Your Vector Database) laws = [ "RCW 9A.56.030: Theft in the first degree. A person is guilty of theft in the first degree if he or she commits theft of property or services which exceed(s) five thousand dollars in value.", "RCW 46.61.502: Driving under the influence. A person is guilty of driving while under the influence of intoxicating liquor...", "RCW 9A.36.011: Assault in the first degree. A person is guilty of assault in the first degree if he or she..." ] # 3. Define the user's search query user_query = "What dollar amount makes a theft a first degree felony?" # 4. CRITICAL: Add the required BGE prefix to the query ONLY query_prefix = "Represent this sentence for searching relevant passages: " formatted_query = query_prefix + user_query # 5. Encode the documents and the query law_embeddings = model.encode(laws, convert_to_tensor=True) query_embedding = model.encode(formatted_query, convert_to_tensor=True) # 6. Calculate Cosine Similarity cosine_scores = util.cos_sim(query_embedding, law_embeddings) # 7. Print the top result best_idx = cosine_scores.argmax().item() print(f"Top Match: {laws[best_idx]}") print(f"Similarity Score: {cosine_scores[0][best_idx]:.4f}")