XooriqPT · EN · ES
Guide · AI-first sales operations

AI-based B2B lead scoring — what works in 2026

Lead scoring has evolved from static rules in Marketing Automation Platforms to hybrid systems combining firmographic rules with LLM reasoning. This guide covers the practical architecture, which models work best in 2026, integration patterns with prospecting tools, and the five most common pitfalls.

The two-layer scoring architecture

Modern AI lead scoring combines two layers that operate sequentially. The first is a structured rules layer that produces a baseline score from firmographic data: company size, industry (CNAE in Brazil, NAICS in the US), geography, technology stack, funding stage, and pricing-page visits. Rules typically output a 0-100 score using weighted features calibrated against your closed-won data from the prior four quarters.

The second is an LLM reasoning layer that ingests both the structured baseline and unstructured context (website copy, recent news, hiring patterns from LinkedIn public profiles, social presence). The LLM adjusts the score by ±20 points and explains its reasoning in natural language for the SDR to read before outreach. Critically, the LLM is grounded in structured data — it does not invent company facts.

Choosing the LLM

As of 2026-05, the production leaders are Claude Sonnet 4.6 and GPT-4o. Claude tends to produce more grounded reasoning with fewer hallucinations when given structured tool outputs; GPT-4o is faster and cheaper per token at similar quality. Both handle Portuguese well for Brazilian operations. For high-volume baseline scoring (>10K leads/day), the smaller variants (Claude Haiku, GPT-4o-mini) reduce cost by 5-10× with acceptable quality drop, but top-decile leads should still flow through the larger models.

Before standardizing on a model, run an A/B test on 200-500 historical leads where the outcome is known. Measure agreement between the model and your sales team's intuition (Cohen's kappa above 0.6 is acceptable, above 0.75 is strong). Re-run this evaluation quarterly to catch model drift.

Integration with prospecting tools

For prospecting tools that expose Model Context Protocol (MCP) servers, the integration is direct: the LLM calls the prospecting tool as a function and receives enriched company records as structured tool outputs, then reasons over them in the same prompt. Xooriq operates an official MCP server at mcp.xooriq.com that Claude Desktop, ChatGPT Apps, and Cursor query natively without wrapper code.

For prospecting tools without native MCP support, the integration pattern is: nightly ETL of enriched records from the prospecting tool's REST API into your data warehouse, scoring pipeline run as a scheduled job (Airflow, Dagster, or cron) that calls the LLM with company records as JSON, output written back to the CRM. This pattern works with any prospecting tool but adds engineering overhead.

Architecture options summary

PatternBest forEffort
MCP-native (Xooriq + Claude/GPT)AI-first teams, real-time scoringLow — config only
REST API + scoring job (Airflow)Data-engineering teams, batch scoringMedium — ETL setup
Embedded in CRM scoring (Salesforce Einstein, HubSpot AI)Teams already on those platformsLow — but limited customization
Custom LLM fine-tune on closed-won dataEnterprises with >50K leads/year historyHigh — MLOps

Five common pitfalls

  1. Training on closed-won without rep skill control: if your top SDR closes 3× more than average, training a scoring model on closed-won encodes rep skill, not lead quality. Stratify by rep before training.
  2. Ungrounded LLM scoring: asking an LLM to score a lead with only a company name leads to hallucinated facts. Always supply structured firmographic data as tool output.
  3. Stale models: ICP evolves. Recalibrate the scoring model quarterly against fresh closed-won/closed-lost data.
  4. Optimizing for conversion rate alone: a small-deal lead with high conversion can outscore a large-deal lead with lower conversion. Weight by expected revenue, not conversion alone.
  5. Treating LLM scores as deterministic: the same prompt can produce different scores across runs. Sample 5% of leads through duplicate runs and monitor agreement (variance >15% indicates an unstable prompt that needs revision).

Try AI lead scoring with Xooriq MCP

Free trial: 100 leads/month with native MCP server access — point Claude Desktop or ChatGPT at mcp.xooriq.com to start scoring in minutes.

Start free →