How it's used
It runs underneath every Landbase workflow. Title expansion before list builds. Concept filters where keyword search misses synonyms. Audience scoping when the customer can't enumerate every label that should count.
Why it matters
Most "filter UI" queries are quietly a semantic-search call underneath. Without it, every workflow narrows to the literal strings the customer remembered to type — and silently drops the variants they didn't.
The Similar Company Graph
Company Data
21.6M companies
→
Embedding
1,024-dim vectors
→
Indexing
Random Partition Forest
→
Retrieval
ANN + neighbor exploration
→
Reranking
with external signals
→
Similar Company Graph
21.5B edges
A naive pairwise comparison would be 233 trillion operations. The optimized pipeline does it in 361 billion — about 0.15% of brute force.
Stage by stage
- Embedding. Every company is turned into a 1,024-dimensional vector that captures its meaning — description, services, signals, the whole picture compressed into a numeric fingerprint.
- Random Partition Forest. A spatial index over the vectors. Lets us look up "close" without comparing every pair.
- Approximate nearest neighbors + neighbor exploration. Two retrieval stages stacked. ANN gets the rough candidate set; a follow-up exploration step adds 140B candidates to fill in what ANN missed.
- Reranking with external data. Top 1K neighbors get re-scored using signals the embedding alone doesn't see. Final recall lands at 100% at k=1, 99.7% at k=10, 99.5% at k=100.
- Similar Company Graph. The output of the pipeline — a graph with 21.5 billion edges connecting every company to its closest peers. Lookups against it are instant.
By the numbers
21.6MCompanies indexed
1,024-dimEmbedding vectors
21.5BEdges in the graph
99.7%Recall @ k=10