Search, Don't Browse: Agentic LLM-Powered Signal Discovery for Fraud Detection

By Travis Li, Jeffrey Zhou, Shuo DengJun 26, 2026

TL;DR: When a new fraud pattern hits, figuring out which signals already exist in your data — and what new ones you should build — is a manual, time-consuming process. We built a search tool over our internal feature store and event logs that lets fraud analysts describe a new scam in plain English and quickly discover relevant signals.

Under the hood, it combines traditional keyword search, semantic embeddings, and an LLM that re-ranks verified candidates so it never hallucinates features that don't exist. The tool returns ranked feature cards and new feature ideas in under 30 seconds in typical cases, via both an interactive UI and a machine-readable spec that plugs into our model training and rule generation pipelines.

In practical terms, this is a search engine over internal features and events. A fraud analyst types a description of a new scam, "user is called by a fake support agent and convinced to drain their account," and gets back a ranked list of the most relevant existing features, computed signal ideas, and proposed new features to build from raw event streams. A task that typically takes hours of catalog browsing and colleague pinging across the industry now takes under a minute.

The Problem: Features Are the Bottleneck

Fraud detection is, at its core, a feature engineering problem. Rule engines and ML models are only as good as the signals they're fed. When a new fraud pattern emerges (a social-engineering phone scam, a novel account-takeover vector, a new payment abuse flow) the first question is always: what signals already exist in our data that would distinguish this pattern from normal behavior?

Answering that question manually is slow and error-prone. Large-scale ML systems in financial services accumulate hundreds of feature views across dozens of teams, each exposing dozens of individual columns. Meanwhile, the raw event stream, the behavioral log that everything else derives from, contains thousands of distinct event types that may offer additional value as formalized features.

In any large financial-services organization, an analyst facing a new incident pattern could spend an entire day browsing the feature catalog, grepping through repository files, and asking colleagues which features are actually used in production. Shortening that cycle is a competitive advantage.

We wanted to automate the first step: given a plain-English description of a fraud pattern, return a ranked list of the most relevant existing features and flag the data sources that could power new features that don't yet exist. This tool is also the upstream foundation for the Feature Enrichment phase described in our companion post on automating risk model retraining with a generative AI tool, the step where a retraining pipeline rapidly identifies which additional signals could strengthen a model's active feature set.

How It Works: Architecture Overview

At a high level, the system has two offline phases that build a knowledge base, and one online phase that answers queries.

Offline — Feature indexing: We pull metadata for feature views from our internal feature store and use an LLM to enrich every column with human-readable descriptions and fraud-relevance notes. Each column is then embedded as a dense vector so that similar signals live near each other in vector space.

Offline — Event indexing: We maintain a human-curated registry of important events. For each event, we verify its schema against the live event stream and document what it represents. These event descriptions are also embedded and indexed.

Online — Query pipeline: When an analyst submits a description of a fraud pattern, we run keyword and semantic search over the feature and event indexes, fuse the results, and use an LLM to re-rank the most promising candidates. The output is a set of feature cards and forward-looking feature ideas, available both in the UI and as a JSON spec for downstream systems.

The rest of this post walks through each of these stages at a conceptual level.

Building the Knowledge Base

Enriching Feature Metadata

A common challenge in large-scale ML systems, particularly in the financial sector, is the lack of documentation explaining the thousands of data signals used by models. Feature names are often cryptic and machine-generated — things like aggregate_payment_volume_by_region_sitewide_24h — making them difficult for new analysts or platform engineers to search and understand.

To address this documentation debt and make these signals discoverable, we developed an AI-powered enrichment pipeline. For each feature view, we:

Pull metadata such as view name, owning team, data source, and feature service membership.
Locate the corresponding definition in the codebase.
Ask an LLM to write, for every column, a short description of what it measures and why it matters for fraud.

The result is a plain-English view of our feature store. For example:

We then embed each column as a vector, combining the column name, its description, its data source, and its fraud relevance note into a single text representation. Crucially, we embed at the column level, not just the view level. A view about send behavior might contain both highly relevant signals (recent transfer velocity) and less relevant ones (lifetime deposit volume). Column-level embeddings let the retrieval system find the right signal within the right view.

Indexing Events

We apply a similar approach to events. Unlike feature views, events don't have a structured metadata catalog — you learn about them by querying your data warehouse or reading team documentation. To make this more discoverable:

We maintain a human-curated registry of important events.
For each event, we record its name, category, stability rating, and a verified parameter schema.
We enrich entries with descriptions of what the event represents and how it might be used for fraud detection.

These enriched event entries are embedded and indexed alongside features. This gives the system visibility into both existing features and raw behavioral signals that could power new features.

The Query Pipeline

When a fraud analyst pastes in an incident description, the pipeline does three main things.

1. Find Candidates with Hybrid Search

First, we run both keyword and semantic search over the feature and event indexes:

Keyword search focuses on exact tokens — feature names, column names, event names, and known domain terms. If an analyst mentions "ACH", "device fingerprint", or "IDV", keyword search ensures those signals surface.
Semantic search looks for meaning, not exact words. A description like "user receives fake customer service call and is convinced to move funds" can still match features about phone verification, wallet address history, and send velocity, even if those words don't appear verbatim.

We then fuse the results using a standard hybrid retrieval technique called Reciprocal Rank Fusion, which combines multiple ranked lists without hand-tuned weights. In practice, this gives us a robust, high-recall candidate set for the next step.

2. Use the LLM as a Re-Ranker, Not a Generator

The fused candidate list is passed to an LLM along with the original incident description. The model's job is to score and re-rank real candidates — not to invent new feature names.

For each candidate feature view, it:

Assigns a relevance score on a 1–10 scale.
Classifies the signal type (velocity, identity, device, payment, etc.).
Provides a short explanation of why it might help detect the pattern.

We also put some practical constraints around this step:

We cap the number of candidates and the length of explanations so responses stay parseable and within token limits.
We add simple post-processing (for example, basic JSON validation) to guard against malformed responses.

The key design principle is that the LLM never operates in free-form "generate anything" mode. It only re-orders verified candidates, so it can't hallucinate features that don't exist in our systems. We periodically spot-check LLM rankings against known fraud patterns to verify that the re-ranker's scoring remains well-calibrated over time.

3. Generate Feature Cards and Proposals

The top results are turned into feature cards that give analysts the context they need to act:

Whether the feature is already present in their chosen training dataset.
Where it lives (for example, which offline table and join keys).
Whether a newer version of the view exists.

In parallel, we use the LLM to generate forward-looking proposals:

Derived features — new computed signals that can be built from existing columns (for example, velocity ratios, time deltas, boolean indicators).
Event-based features — new feature definitions built directly from raw behavioral events, including a starting point for validation queries and feature code.

Highly volatile events (those whose payload values change frequently with rule configurations) are excluded from proposals so we don't suggest features that would immediately go stale.

Two Ways to Use It

The discovery pipeline is exposed in two forms, depending on who is calling it.

Interactive UI for Analysts

For fraud analysts and investigators, there's a web interface available without any local setup. You paste in a fraud pattern description, optionally add domain keywords, select the training dataset that matches your use case, and click Discover Features.

The results are organized into five tabs:

Feature Cards — Ranked, filterable table of existing feature views — in-training/external flag, relevance score, signal type, join key, offline table.
Derived Feature Ideas — Computed features derivable from existing columns, with feature-platform pseudocode.
Event Feature Ideas — Proposed new feature definitions from raw event streams, with SQL + feature definition sketches.
ELT SQL — A skeleton query joining the training table to all recommended external feature tables.
Dataset Spec / Raw JSON — A structured spec ready for downstream tools: training table, label column, entity keys, feature views.

A typical run returns 30–60 ranked feature cards in under 30 seconds, based on our internal benchmarks. The equivalent manual search (browsing a feature catalog, reading repository files, cross-referencing service membership)is a common pain point across the industry.

Internal Automation API

The same pipeline is callable as a Python function, designed to be invoked programmatically by internal automation or agent workflows. The intended integration point is the Feature Enrichment step of our risk model retrain workflow (described in our companion post on automating risk model retraining with a generative AI tool), where the retrain workflow will call it to get a ranked feature set without going through the UI:

The return value is a structured dictionary with the same outputs shown in the UI, serializable to JSON. The retrain workflow can use this to query the pre-built index rather than crawling the feature store from scratch on each run, eliminating duplicated work between the two systems.

The Output as a Machine-Readable Spec

The most important downstream use of the tool is as a structured input to other systems. The dataset_spec output is a JSON object that fully describes how to construct a training dataset:

This spec feeds directly into two downstream systems:

Model training / retraining — A training pipeline can consume the spec to automatically build the feature join query, pull training data, and kick off a model training run. The feature discovery step becomes the first stage in a reproducible, documented pipeline rather than an implicit step that lives only in an analyst's head.

Automated rule generation — Our rule auto-generation system accepts this same spec format. Once reviewed by a human, the spec can be handed off to train a decision tree, generate candidate rules, and surface them for analyst review. Feature discovery bridges the gap between "we observed a new fraud pattern" and "we have a trained model and a set of rules ready to deploy."

The human stays in the loop at the feature selection step, reviewing the ranked cards, choosing which external features to join in, and approving the spec before it flows downstream. In our experience, the tool accelerates that review from hours to minutes.

What's Next

The longer-term vision is a closed loop: a fraud analyst describes a new pattern, the tool returns a reviewed and approved dataset spec, our rule generation system trains a model and generates candidate rules, and those rules are deployed all within a single documented, reproducible workflow. Feature discovery is the first step in making that loop fast enough to be useful when it matters most: the first hours after a new attack pattern is identified.