Where large language models are actually deployed in the payments stack — and where they are not

The payments industry has not been immune to large language model hype. But underneath the press releases, a clearer picture is emerging: LLMs are genuinely useful for specific tasks in the payments stack, and largely irrelevant for others.

Where LLMs are genuinely deployed

Dispute and chargeback handling. Stripe has spoken publicly about using GPT-4 class models to analyse dispute evidence — merchant receipts, delivery confirmations, customer communications — and generate draft responses. The model reads unstructured documents and produces structured summaries that human agents review. This reduces average dispute handling time significantly without removing human oversight from the decision.

AML narrative generation. Banks are required to file Suspicious Activity Reports (SARs) when they identify potentially illicit transactions. Writing these narratives is time-consuming and formulaic — exactly the kind of task LLMs handle well. Several tier-one banks are now using LLM-assisted SAR drafting, with compliance officers reviewing and approving before submission.

Where LLMs are not used

Real-time transaction scoring is not an LLM use case. Scoring a transaction in under 50 milliseconds requires specialised gradient boosting models or neural networks trained specifically on transaction data — not general-purpose language models. LLMs are too slow, too computationally expensive, and not trained on the right data for this task.

The distinction matters because conflating LLMs with AI in payments leads to unrealistic expectations. The payments stack uses many types of machine learning — most of which have nothing to do with language models.