It seems there is some known know-how.
What matters most in your case
Your queries are short (4–5 tokens) but dense with constraints:
- Numbers + units:
12 kVA, 5 hp
- Counts:
3 axis, 3 phase
- Alphanumerics: model/SKU/standards (common in B2B)
Dense embeddings are excellent for semantic proximity (“generator” ↔ “genset”, “industrial backup” ↔ “standby”), but they are often unreliable at enforcing numeric precision and exact spec constraints in a cosine-similarity-only setup. Empirical work on embeddings and numeric detail shows this is a consistent failure mode. (Hugging Face)
For that reason, the best-performing B2B search stacks treat embeddings as one signal in reranking, not the judge of truth for numbers/units/codes.
Target behavior: “spec correctness” first, semantics second
For B2B, success is typically:
- Retrieve the right product type (generator, compressor, milling machine)
- Ensure the top results satisfy the hard constraints (kVA/HP/axis/phase)
- Use semantics to break ties (application, cooling type, brand preference, etc.)
This informs how you should preprocess attributes and how you should score candidates.
Step 1 — Normalize the catalog attributes before embedding
Attribute embeddings become dramatically more stable once the underlying attributes are consistent.
1) Canonicalize attribute keys (schema consolidation)
Make a canonical key set and map synonyms into it:
power, power_rating, rated_power → power_rating
phases, phase_count → phase
cooling, cooling_type → cooling_type
Store:
This reduces fragmentation in both lexical and semantic matching.
2) Normalize numeric values + units (raw + canonical)
For each numeric attribute, store both representations:
value_raw_text: "12 kva"
value_number: 12
unit_raw: "kva"
unit_canonical: "kVA"
value_canonical_base: 12000
unit_base: "VA"
This enables:
- exact/range matching
- unit conversion (kVA ↔ VA, HP ↔ W, inch ↔ mm)
- controlled tolerances (±5%, bucket ranges)
3) Normalize categorical attributes into controlled vocab
Example:
fuel_type: diesel / gasoline / natural_gas / LPG
cooling_type: air_cooled / water_cooled / oil_cooled
Store:
value_raw
value_canonical
4) Normalize codes (alphanumerics) into multiple searchable forms
For MPNs, SKUs, standards, thread sizes, etc.:
AB-1234 → AB1234, AB 1234
M12x1.75 → M12 x 1.75, M12x1.75
These fields usually need strong lexical treatment (exact/partial/regex/character n-grams). Dense embeddings should not be your only tool here.
Step 2 — Build the attribute text that you embed (the best “view” for spec queries)
You listed 4 serialization options. For your query style, the best default is:
Recommended default: Line-separated key: value (option #3)
Example “spec view” for the generator:
product_type: diesel generator
power_rating: 12 kVA
power_rating_va: 12000 VA
fuel_type: diesel
phase: 3
cooling_type: air cooled
application: industrial backup
Why it works well:
- Keys disambiguate values (especially numbers like
3).
- Newlines preserve boundaries cleanly (avoids attribute “bleeding” that happens with flat concatenation).
- Adding canonical numeric variants (
power_rating_va) gives stable anchors.
Add controlled redundancy for query variance
Users type messy variants: 12kva, 12 kva, 12 KVA, 3-phase, 3 phase.
Add a small number of variants (don’t spam):
power_rating: 12 kVA (raw: 12 kva)
phase: 3 (aka: 3-phase, three phase)
Keep it consistent across products.
Step 3 — Don’t embed “attributes” as only one blob; use two views
A single attribute blob forces one vector to represent both “hard specs” and “soft semantics.” In B2B, those behave differently.
View A: Spec view (structured, compact)
- The line-separated
key: value text above
- Goal: capture spec tokens and field context
View B: Intent view (short, template-like natural language)
Not a long paragraph. Keep it short:
Industrial standby diesel generator for backup power. 12 kVA, 3-phase, air-cooled.
Goal: improve synonyms and intent matching without drowning out spec tokens.
Why not only passage-style (#4)?
Passage-style can help “application/intent,” but it often introduces filler and reduces the density of spec tokens. In your query regime, that can hurt.
The balanced approach is:
- spec view for constraints
- intent view for semantics
Step 4 — Combined vs per-attribute embeddings (what I would do)
Default: embed combined views, not every attribute separately
spec_vec from the spec view
intent_vec from the intent view
- (and
title_vec for retrieval)
This keeps infra simple and gives strong signal.
Selectively add per-attribute or per-group embeddings (only where it helps)
Per-attribute embeddings make sense when:
- a field is long/semantic (
application, compatible_materials, standards_notes)
- you want explicit weighting by field
They are usually not worth it for:
- small numeric fields (
phase, axis_count, power_rating) because you should score those deterministically
Store multiple vectors per product if your DB supports it
Many vector DBs support multiple vectors per object (e.g., “named vectors”). Qdrant documents storing multiple named vector spaces per point. (qdrant.tech)
Milvus provides multi-vector hybrid search examples and then reranking strategies to merge results. (milvus.io)
Practical implication:
- store
title_vec, spec_vec, intent_vec
- score them separately and combine in reranking
Step 5 — Candidate retrieval should be hybrid (dense + lexical), then fused
Even if your initial plan is “title embeddings for candidates,” B2B search benefits heavily from hybrid retrieval:
- Dense vectors: semantic category matching
- Lexical/BM25: units, codes, exact tokens (
kVA, hp, M12x1.75)
Weaviate explains hybrid search as running keyword (BM25) and vector search in parallel and then fusing results with algorithms like Reciprocal Rank Fusion (RRF). (Weaviate)
Why it matters for you:
- A query like
12 kva diesel generator has “hard anchors” (12, kva, diesel).
- If dense retrieval alone underweights any anchor, you can miss the best candidates entirely.
- Hybrid protects recall.
Step 6 — Attribute reranking: treat numbers/units/codes as explicit features
Reranking is where your “attribute embeddings” should pay off, but not as a single cosine score.
Reranking signals I would compute for each candidate (top-K)
A) Deterministic constraint features (high weight)
From query parsing + catalog normalization:
-
Numeric match score:
- exact match (after unit conversion)
- within tolerance (e.g., ±5% or a domain-specific margin)
- bucket match (
10–15 kVA)
-
Unit compatibility:
- same unit / convertible / mismatch
-
Categorical matches:
fuel_type, phase, axis_count, voltage class, etc.
-
Code match:
- exact/prefix/normalized match for MPN/SKU/standard codes
These features often dominate business relevance in B2B.
B) Embedding similarity features (medium weight)
- sim(query,
spec_vec)
- sim(query,
intent_vec)
- optionally sim(query,
title_vec)
Treat these as soft evidence.
C) Cross-encoder reranker score (often the biggest lift)
A cross-encoder reranker reads query + candidate together and outputs a relevance score directly.
- Cohere’s reranking docs explicitly note support for semi-structured data (JSON) and the ability to set “rank fields” so the model focuses on specific fields. (docs.cohere.com)
- The open
bge-reranker-v2-m3 model card describes reranking as directly scoring (query, document) rather than embedding both separately. (Hugging Face)
Why this helps with your attributes:
- The model sees
phase: 3 and the query token 3 phase in the same context.
- It can learn that
3 is phase here, not axis count.
A concrete scoring shape
For each candidate document (d) and query (q):
-
score =
w_rerank * reranker(q, d)
+ w_num * numeric_match(q, d)
+ w_cat * categorical_match(q, d)
+ w_code * code_match(q, d)
+ w_vec * (sim_spec + sim_intent)
+ w_lex * bm25(q, d) (optional in reranking)
Start with hand-tuned weights, then learn them (LTR) once you have clicks/orders.
Step 7 — Answering your specific questions directly
1) Single combined text vs individual attribute embeddings
Recommended default
- Single spec view embedding + single intent view embedding
Selective per-attribute embeddings
- Only for long semantic fields where it improves recall/precision and where field weighting matters.
2) Does preserving keys help?
Yes, especially for numbers and ambiguous short values. Keys create context that disambiguates 3 and 12. Field-aware reranking approaches also assume multi-field structure. (docs.cohere.com)
3) Are separators/formatting important?
Yes, but simple is best:
key: value with newlines is robust and debuggable.
| separators are fine; they usually don’t outperform newlines if keys are present.
4) Best practices for numeric values/units/alphanumerics
- Parse and normalize into canonical numeric forms (for matching/filtering)
- Keep raw strings alongside canonical forms (for audit + lexical anchoring)
- Add a small set of common aliases/variants (not too many)
- Treat codes as lexical-first (exact/prefix/n-gram), and use embeddings as secondary
5) Passage-style vs structured key-value
- Passage-style helps soft semantics (application, intent)
- Structured KV helps constraint grounding
Use both views; don’t force one representation to do both jobs.
Step 8 — Model choice: what I would evaluate for your pipeline
You mentioned Marqo ecommerce embedding (large); it is explicitly positioned as an ecommerce embedding model on Hugging Face. (Hugging Face)
For your case, I would evaluate a small, controlled shortlist:
Embeddings
- Marqo/marqo-ecommerce-embeddings-L (commerce-tuned baseline) (Hugging Face)
- BAAI/bge-m3 (popular general retrieval baseline; good for long text and multi-granularity setups) (Hugging Face)
- Qwen/Qwen3-Embedding-4B (embedding + ranking family; useful if you want paired embed + rerank within one ecosystem) (Hugging Face)
Reranking
- BAAI/bge-reranker-v2-m3 (open reranker; query+doc → score) (Hugging Face)
- If using Cohere rerank, exploit rank fields to prioritize specific keys/fields in your semi-structured document. (docs.cohere.com)
The model choice should ultimately be driven by your own “spec-heavy” evaluation set (next section).
Step 9 — Evaluation that matches B2B reality (what I would measure)
Generic IR metrics can hide spec failures. You want at least one metric that measures constraint satisfaction.
Build an internal benchmark (must-have)
Create a labeled set stratified by query type:
- numeric + unit (
12 kVA, 5 hp, 200 psi)
- count constraints (
3 axis, 3 phase, 2 pole)
- codes (
AB-1234, M12x1.75)
- pure semantic queries (no numbers)
Track:
- Recall@K for candidate retrieval
- nDCG@10 / MRR for reranking quality
- Constraint satisfaction rate: top-1 satisfies extracted constraints (your business KPI)
Do ablations (so you know what helped)
Run these variants:
- title_vec only
- title_vec + spec_vec
- title_vec + spec_vec + intent_vec
- add deterministic numeric/categorical/code features
- add reranker (cross-encoder)
This isolates whether embeddings are helping, and where.
Step 10 — Common pitfalls and how to avoid them
1) Over-reliance on cosine similarity for specs
This is the biggest cause of “looks relevant but wrong kVA/HP/axis” results. Use deterministic features for constraints and rerankers for context.
2) Too many attributes in the embedded text
Dumping 150 attributes reduces signal density. Prefer:
- a fixed “high-signal” attribute set per category
- plus a few category-specific keys
3) Multi-vector infra friction
Many frameworks assume one vector per record. A recent LlamaIndex issue shows practical friction when trying to use multiple dense vector fields in Milvus-backed stores, with workarounds like pre-creating schemas. (GitHub)
Plan for this early: choose a store/framework path that supports multi-vector cleanly or isolate it in your application layer.
4) Losing raw tokens during normalization
If you normalize away user-typed variants, you can hurt lexical/hybrid matching. Keep raw forms.
5) Logging and debuggability gaps
For every query, log:
- parsed constraints
- matched constraints per result
- spec view text used
- intent view text used
- per-signal scores (numeric, lexical, vector, reranker)
This turns relevance tuning into an engineering loop.
A practical “do this first” implementation plan
-
Normalize attributes (keys, numeric units, categorical vocab, code variants).
-
Create two texts per product:
- spec view: newline
key: value + canonical numeric fields
- intent view: short template summary
-
Embed title, spec, intent. Store as separate vectors if supported (named vectors / multi-vector). (qdrant.tech)
-
Candidate retrieval: hybrid (BM25 + dense) fused via RRF. (Weaviate)
-
Rerank top-K using:
- deterministic constraint features (numeric/unit/categorical/code)
- embedding similarities
- a cross-encoder reranker (field-aware if possible) (docs.cohere.com)
-
Evaluate with a spec-heavy benchmark and iterate via ablations.
Recommendation on your 6 strategies (final)
If you want a single answer:
- Use (3) line-separated key/value as the primary attribute embedding input (spec view).
- Add a second short intent view (controlled natural language) for semantics.
- Use (5) per-attribute embeddings only for a small set of semantic fields if needed.
- Do not rely on (4) long passages alone for spec-heavy queries.
- Back embeddings with deterministic numeric/unit/code matching and (ideally) a cross-encoder reranker for final ordering. (docs.cohere.com)