Efficient Integration of ASR with Large Language Models to Enhance Video Search at Scale

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Currently, video is the most consumed content on the internet, with an increasing amount of information and knowledge shared as videos. This has led to a significant rise in video search requests. On Bilibili, one of China's largest knowledge-based video platforms, over half of daily users search for videos. Recent advancements in multimodal embeddings and text-to-video retrieval have shown promise, yet large-scale systems require search results within seconds. The high computational costs of image-based models have limited their scalability in video retrieval, with text-based searches remaining dominant and many video contents under-indexed. This paper presents a design that integrates ASR (Automatic Speech Recognition) text and uses large language models (LLMs) to optimize video search. ASR provides cost-effective video understanding and has been widely used for generating subtitles. However, in search scenarios, it suffers from noise and entity errors, impacting accuracy and leading to false retrieval. We leverage LLMs to generate high-quality text and summaries from original ASR text, integrating them into the video search engine to retrieval. LLMs are also used to enhance query understanding and relevance. Additionally, we enables direct answer generation from video summaries when watching is inconvenient for users. Offline evaluations and user experiments show significant improvements in search satisfaction while maintaining manageable computational costs. Deployed on Bilibili for a year, the enhanced video search engine has received daily feedback from millions of users, providing a best practice for using LLMs in video search and lessons for further optimization.

Original languageEnglish
Title of host publicationWWW Companion 2025 - Companion Proceedings of the ACM Web Conference 2025
PublisherAssociation for Computing Machinery, Inc
Pages601-610
Number of pages10
ISBN (Electronic)9798400713316
DOIs
StatePublished - 23 May 2025
Event34th ACM Web Conference, WWW Companion 2025 - Sydney, Australia
Duration: 28 Apr 20252 May 2025

Publication series

NameWWW Companion 2025 - Companion Proceedings of the ACM Web Conference 2025

Conference

Conference34th ACM Web Conference, WWW Companion 2025
Country/TerritoryAustralia
CitySydney
Period28/04/252/05/25

Keywords

  • Automatic Speech Recognition
  • Large Language Model
  • Video Search Engine

Fingerprint

Dive into the research topics of 'Efficient Integration of ASR with Large Language Models to Enhance Video Search at Scale'. Together they form a unique fingerprint.

Cite this