Best way to generate embeddings for structured product attributes in B2B ecommerce search

I am building a B2B product search system using vector embeddings and would like advice specifically on how to generate embeddings for structured product attributes.

Context

  • Domain: B2B ecommerce

  • Queries: Short keyword-style searches (4 to 5 tokens), often containing numbers, units, and alphanumeric attributes
    Examples:

    • “12 kva diesel generator”

    • “5 hp air compressor”

    • “cnc milling machine 3 axis”

Search architecture

  • Initial candidate retrieval using product title embeddings

  • Reranking using product attribute embeddings

Product data

Each product has a title and a set of structured attributes stored as key-value pairs.

Example:

Product: Diesel Generator

Attributes:

  • “power_rating: 12 kva”

  • “fuel_type: diesel”

  • “phase: 3”

  • “cooling_type: air cooled”

  • “application: industrial backup”

Main question

What is the best way to preprocess and embed these attributes for semantic reranking?

Attribute embedding strategies we are considering

  1. Flat concatenation

    power rating 12 kva fuel type diesel phase 3 cooling type air cooled application industrial backup
    
    
  2. Key-value with separators

    power_rating: 12 kva | fuel_type: diesel | phase: 3 | cooling_type: air cooled | application: industrial backup
    
    
  3. Line-separated attributes

    power_rating: 12 kva
    fuel_type: diesel
    phase: 3
    cooling_type: air cooled
    application: industrial backup
    
    
  4. Natural language passage

    This diesel generator has a power rating of 12 kva, uses diesel fuel, supports 3 phase operation, and is air cooled for industrial backup usage.
    
    
  5. Per-attribute embeddings

    • Generate one embedding per attribute and aggregate scores during reranking
  6. Any other recommended method?

Specific questions

  • Should attributes be embedded as a single combined text or as individual attribute embeddings

  • Does explicitly preserving attribute keys help embedding quality

  • Are separator tokens or structured formatting important for short, attribute-heavy queries

  • Any best practices for handling numeric values, units, and alphanumeric attributes

  • Whether passage-style text performs better than structured key-value text for dense retrieval

Model considerations

  • Currently considering Marqo ecommerce embedding (large)

  • Open to recommendations for other models that work well for:

    • Short B2B queries

    • Numeric and unit-heavy matching

    • Attribute-based reranking

1 Like

It seems there is some known know-how.


What matters most in your case

Your queries are short (4–5 tokens) but dense with constraints:

  • Numbers + units: 12 kVA, 5 hp
  • Counts: 3 axis, 3 phase
  • Alphanumerics: model/SKU/standards (common in B2B)

Dense embeddings are excellent for semantic proximity (“generator” ↔ “genset”, “industrial backup” ↔ “standby”), but they are often unreliable at enforcing numeric precision and exact spec constraints in a cosine-similarity-only setup. Empirical work on embeddings and numeric detail shows this is a consistent failure mode. (Hugging Face)

For that reason, the best-performing B2B search stacks treat embeddings as one signal in reranking, not the judge of truth for numbers/units/codes.


Target behavior: “spec correctness” first, semantics second

For B2B, success is typically:

  1. Retrieve the right product type (generator, compressor, milling machine)
  2. Ensure the top results satisfy the hard constraints (kVA/HP/axis/phase)
  3. Use semantics to break ties (application, cooling type, brand preference, etc.)

This informs how you should preprocess attributes and how you should score candidates.


Step 1 — Normalize the catalog attributes before embedding

Attribute embeddings become dramatically more stable once the underlying attributes are consistent.

1) Canonicalize attribute keys (schema consolidation)

Make a canonical key set and map synonyms into it:

  • power, power_rating, rated_powerpower_rating
  • phases, phase_countphase
  • cooling, cooling_typecooling_type

Store:

  • key_raw
  • key_canonical

This reduces fragmentation in both lexical and semantic matching.

2) Normalize numeric values + units (raw + canonical)

For each numeric attribute, store both representations:

  • value_raw_text: "12 kva"
  • value_number: 12
  • unit_raw: "kva"
  • unit_canonical: "kVA"
  • value_canonical_base: 12000
  • unit_base: "VA"

This enables:

  • exact/range matching
  • unit conversion (kVA ↔ VA, HP ↔ W, inch ↔ mm)
  • controlled tolerances (±5%, bucket ranges)

3) Normalize categorical attributes into controlled vocab

Example:

  • fuel_type: diesel / gasoline / natural_gas / LPG
  • cooling_type: air_cooled / water_cooled / oil_cooled

Store:

  • value_raw
  • value_canonical

4) Normalize codes (alphanumerics) into multiple searchable forms

For MPNs, SKUs, standards, thread sizes, etc.:

  • AB-1234AB1234, AB 1234
  • M12x1.75M12 x 1.75, M12x1.75

These fields usually need strong lexical treatment (exact/partial/regex/character n-grams). Dense embeddings should not be your only tool here.


Step 2 — Build the attribute text that you embed (the best “view” for spec queries)

You listed 4 serialization options. For your query style, the best default is:

Recommended default: Line-separated key: value (option #3)

Example “spec view” for the generator:

product_type: diesel generator
power_rating: 12 kVA
power_rating_va: 12000 VA
fuel_type: diesel
phase: 3
cooling_type: air cooled
application: industrial backup

Why it works well:

  • Keys disambiguate values (especially numbers like 3).
  • Newlines preserve boundaries cleanly (avoids attribute “bleeding” that happens with flat concatenation).
  • Adding canonical numeric variants (power_rating_va) gives stable anchors.

Add controlled redundancy for query variance

Users type messy variants: 12kva, 12 kva, 12 KVA, 3-phase, 3 phase.

Add a small number of variants (don’t spam):

  • power_rating: 12 kVA (raw: 12 kva)
  • phase: 3 (aka: 3-phase, three phase)

Keep it consistent across products.


Step 3 — Don’t embed “attributes” as only one blob; use two views

A single attribute blob forces one vector to represent both “hard specs” and “soft semantics.” In B2B, those behave differently.

View A: Spec view (structured, compact)

  • The line-separated key: value text above
  • Goal: capture spec tokens and field context

View B: Intent view (short, template-like natural language)

Not a long paragraph. Keep it short:

Industrial standby diesel generator for backup power. 12 kVA, 3-phase, air-cooled.

Goal: improve synonyms and intent matching without drowning out spec tokens.

Why not only passage-style (#4)?

Passage-style can help “application/intent,” but it often introduces filler and reduces the density of spec tokens. In your query regime, that can hurt.

The balanced approach is:

  • spec view for constraints
  • intent view for semantics

Step 4 — Combined vs per-attribute embeddings (what I would do)

Default: embed combined views, not every attribute separately

  • spec_vec from the spec view
  • intent_vec from the intent view
  • (and title_vec for retrieval)

This keeps infra simple and gives strong signal.

Selectively add per-attribute or per-group embeddings (only where it helps)

Per-attribute embeddings make sense when:

  • a field is long/semantic (application, compatible_materials, standards_notes)
  • you want explicit weighting by field

They are usually not worth it for:

  • small numeric fields (phase, axis_count, power_rating) because you should score those deterministically

Store multiple vectors per product if your DB supports it

Many vector DBs support multiple vectors per object (e.g., “named vectors”). Qdrant documents storing multiple named vector spaces per point. (qdrant.tech)
Milvus provides multi-vector hybrid search examples and then reranking strategies to merge results. (milvus.io)

Practical implication:

  • store title_vec, spec_vec, intent_vec
  • score them separately and combine in reranking

Step 5 — Candidate retrieval should be hybrid (dense + lexical), then fused

Even if your initial plan is “title embeddings for candidates,” B2B search benefits heavily from hybrid retrieval:

  • Dense vectors: semantic category matching
  • Lexical/BM25: units, codes, exact tokens (kVA, hp, M12x1.75)

Weaviate explains hybrid search as running keyword (BM25) and vector search in parallel and then fusing results with algorithms like Reciprocal Rank Fusion (RRF). (Weaviate)

Why it matters for you:

  • A query like 12 kva diesel generator has “hard anchors” (12, kva, diesel).
  • If dense retrieval alone underweights any anchor, you can miss the best candidates entirely.
  • Hybrid protects recall.

Step 6 — Attribute reranking: treat numbers/units/codes as explicit features

Reranking is where your “attribute embeddings” should pay off, but not as a single cosine score.

Reranking signals I would compute for each candidate (top-K)

A) Deterministic constraint features (high weight)

From query parsing + catalog normalization:

  • Numeric match score:

    • exact match (after unit conversion)
    • within tolerance (e.g., ±5% or a domain-specific margin)
    • bucket match (10–15 kVA)
  • Unit compatibility:

    • same unit / convertible / mismatch
  • Categorical matches:

    • fuel_type, phase, axis_count, voltage class, etc.
  • Code match:

    • exact/prefix/normalized match for MPN/SKU/standard codes

These features often dominate business relevance in B2B.

B) Embedding similarity features (medium weight)

  • sim(query, spec_vec)
  • sim(query, intent_vec)
  • optionally sim(query, title_vec)

Treat these as soft evidence.

C) Cross-encoder reranker score (often the biggest lift)

A cross-encoder reranker reads query + candidate together and outputs a relevance score directly.

  • Cohere’s reranking docs explicitly note support for semi-structured data (JSON) and the ability to set “rank fields” so the model focuses on specific fields. (docs.cohere.com)
  • The open bge-reranker-v2-m3 model card describes reranking as directly scoring (query, document) rather than embedding both separately. (Hugging Face)

Why this helps with your attributes:

  • The model sees phase: 3 and the query token 3 phase in the same context.
  • It can learn that 3 is phase here, not axis count.

A concrete scoring shape

For each candidate document (d) and query (q):

  • score =

    • w_rerank * reranker(q, d)
    • + w_num * numeric_match(q, d)
    • + w_cat * categorical_match(q, d)
    • + w_code * code_match(q, d)
    • + w_vec * (sim_spec + sim_intent)
    • + w_lex * bm25(q, d) (optional in reranking)

Start with hand-tuned weights, then learn them (LTR) once you have clicks/orders.


Step 7 — Answering your specific questions directly

1) Single combined text vs individual attribute embeddings

Recommended default

  • Single spec view embedding + single intent view embedding

Selective per-attribute embeddings

  • Only for long semantic fields where it improves recall/precision and where field weighting matters.

2) Does preserving keys help?

Yes, especially for numbers and ambiguous short values. Keys create context that disambiguates 3 and 12. Field-aware reranking approaches also assume multi-field structure. (docs.cohere.com)

3) Are separators/formatting important?

Yes, but simple is best:

  • key: value with newlines is robust and debuggable.
  • | separators are fine; they usually don’t outperform newlines if keys are present.

4) Best practices for numeric values/units/alphanumerics

  • Parse and normalize into canonical numeric forms (for matching/filtering)
  • Keep raw strings alongside canonical forms (for audit + lexical anchoring)
  • Add a small set of common aliases/variants (not too many)
  • Treat codes as lexical-first (exact/prefix/n-gram), and use embeddings as secondary

5) Passage-style vs structured key-value

  • Passage-style helps soft semantics (application, intent)
  • Structured KV helps constraint grounding
    Use both views; don’t force one representation to do both jobs.

Step 8 — Model choice: what I would evaluate for your pipeline

You mentioned Marqo ecommerce embedding (large); it is explicitly positioned as an ecommerce embedding model on Hugging Face. (Hugging Face)

For your case, I would evaluate a small, controlled shortlist:

Embeddings

  • Marqo/marqo-ecommerce-embeddings-L (commerce-tuned baseline) (Hugging Face)
  • BAAI/bge-m3 (popular general retrieval baseline; good for long text and multi-granularity setups) (Hugging Face)
  • Qwen/Qwen3-Embedding-4B (embedding + ranking family; useful if you want paired embed + rerank within one ecosystem) (Hugging Face)

Reranking

  • BAAI/bge-reranker-v2-m3 (open reranker; query+doc → score) (Hugging Face)
  • If using Cohere rerank, exploit rank fields to prioritize specific keys/fields in your semi-structured document. (docs.cohere.com)

The model choice should ultimately be driven by your own “spec-heavy” evaluation set (next section).


Step 9 — Evaluation that matches B2B reality (what I would measure)

Generic IR metrics can hide spec failures. You want at least one metric that measures constraint satisfaction.

Build an internal benchmark (must-have)

Create a labeled set stratified by query type:

  1. numeric + unit (12 kVA, 5 hp, 200 psi)
  2. count constraints (3 axis, 3 phase, 2 pole)
  3. codes (AB-1234, M12x1.75)
  4. pure semantic queries (no numbers)

Track:

  • Recall@K for candidate retrieval
  • nDCG@10 / MRR for reranking quality
  • Constraint satisfaction rate: top-1 satisfies extracted constraints (your business KPI)

Do ablations (so you know what helped)

Run these variants:

  1. title_vec only
  2. title_vec + spec_vec
  3. title_vec + spec_vec + intent_vec
  4. add deterministic numeric/categorical/code features
  5. add reranker (cross-encoder)

This isolates whether embeddings are helping, and where.


Step 10 — Common pitfalls and how to avoid them

1) Over-reliance on cosine similarity for specs

This is the biggest cause of “looks relevant but wrong kVA/HP/axis” results. Use deterministic features for constraints and rerankers for context.

2) Too many attributes in the embedded text

Dumping 150 attributes reduces signal density. Prefer:

  • a fixed “high-signal” attribute set per category
  • plus a few category-specific keys

3) Multi-vector infra friction

Many frameworks assume one vector per record. A recent LlamaIndex issue shows practical friction when trying to use multiple dense vector fields in Milvus-backed stores, with workarounds like pre-creating schemas. (GitHub)
Plan for this early: choose a store/framework path that supports multi-vector cleanly or isolate it in your application layer.

4) Losing raw tokens during normalization

If you normalize away user-typed variants, you can hurt lexical/hybrid matching. Keep raw forms.

5) Logging and debuggability gaps

For every query, log:

  • parsed constraints
  • matched constraints per result
  • spec view text used
  • intent view text used
  • per-signal scores (numeric, lexical, vector, reranker)

This turns relevance tuning into an engineering loop.


A practical “do this first” implementation plan

  1. Normalize attributes (keys, numeric units, categorical vocab, code variants).

  2. Create two texts per product:

    • spec view: newline key: value + canonical numeric fields
    • intent view: short template summary
  3. Embed title, spec, intent. Store as separate vectors if supported (named vectors / multi-vector). (qdrant.tech)

  4. Candidate retrieval: hybrid (BM25 + dense) fused via RRF. (Weaviate)

  5. Rerank top-K using:

    • deterministic constraint features (numeric/unit/categorical/code)
    • embedding similarities
    • a cross-encoder reranker (field-aware if possible) (docs.cohere.com)
  6. Evaluate with a spec-heavy benchmark and iterate via ablations.


Recommendation on your 6 strategies (final)

If you want a single answer:

  • Use (3) line-separated key/value as the primary attribute embedding input (spec view).
  • Add a second short intent view (controlled natural language) for semantics.
  • Use (5) per-attribute embeddings only for a small set of semantic fields if needed.
  • Do not rely on (4) long passages alone for spec-heavy queries.
  • Back embeddings with deterministic numeric/unit/code matching and (ideally) a cross-encoder reranker for final ordering. (docs.cohere.com)