Domain-Specific Embedding Model for Washington State Law
Washington-state-law-embedding-model-Large
An advanced, parameter-rich sentence embedding model fine-tuned specifically for the Revised Code of Washington (RCW). Based on BAAI/bge-large, it maps legal queries and statutory text to a 1024-dimensional dense vector space, bridging the gap between conversational "street-language" queries and formal statutory text.
Performance Analysis
This section details the critical retrieval metrics comparing the untrained BAAI/bge-large-en-v1.5 base model against our fine-tuned custom model. Evaluated against a held-out validation set of ~47,000 unique Washington State law queries, fine-tuning achieved the absolute mathematical ceiling for this legal dataset, successfully mapping natural language queries to their preempting statutes.
Retrieval Metrics Comparison
Edge-Case Deep Dive
General accuracy is important, but a law enforcement tool must perform reliably under highly specific, granular search conditions. We isolated complex query types (like specific procedures or dense legal tasks) to verify the model's robustness.
Legal Task (specific_term)
Successfully handles complex statutory terminology with high fidelity across the 1024d space.
Procedure/How-To (specific_task)
Translates actionable, task-oriented questions to relevant RCWs effectively.
General Search (statute_lookup)
Maintains an excellent baseline for broad, topical inquiries about state laws.
Implementation Guide
Integrating this custom model into your Python stack is straightforward via the
sentence-transformers library. However, strict adherence to the query formatting rules
below is absolutely critical to achieve the benchmarked accuracy.
⚠ CRITICAL INSTRUCTION
Because this model inherits the BGE-Large architecture, you MUST prepend a specific instruction to user search queries.
- Queries: Prepend
"Represent this sentence for searching relevant passages: " - Database Docs: Do NOT add the prefix. Embed raw text only.
Dependencies
pip install -U sentence-transformers
import torch from sentence_transformers import SentenceTransformer, util # 1. Load the fine-tuned large model model = SentenceTransformer('CSI-lab/Washington-state-law-embedding-model-Large') # 2. Define the laws (Your Vector Database) laws = [ "RCW 9A.56.030: Theft in the first degree. A person is guilty of theft in the first degree if he or she commits theft of property or services which exceed(s) five thousand dollars in value.", "RCW 46.61.502: Driving under the influence. A person is guilty of driving while under the influence of intoxicating liquor...", "RCW 9A.36.011: Assault in the first degree. A person is guilty of assault in the first degree if he or she..." ] # 3. Define the user's search query user_query = "What dollar amount makes a theft a first degree felony?" # 4. CRITICAL: Add the required BGE prefix to the query ONLY query_prefix = "Represent this sentence for searching relevant passages: " formatted_query = query_prefix + user_query # 5. Encode the documents and the query law_embeddings = model.encode(laws, convert_to_tensor=True) query_embedding = model.encode(formatted_query, convert_to_tensor=True) # 6. Calculate Cosine Similarity cosine_scores = util.cos_sim(query_embedding, law_embeddings) # 7. Print the top result best_idx = cosine_scores.argmax().item() print(f"Top Match: {laws[best_idx]}") print(f"Similarity Score: {cosine_scores[0][best_idx]:.4f}")