TY - GEN
T1 - Efficient Integration of ASR with Large Language Models to Enhance Video Search at Scale
AU - Zhang, Qiang
AU - Xiao, Fengshun
AU - Li, Tianjiao
AU - Lin, Li
AU - Fang, Hanyin
AU - Sun, Huyang
AU - Liu, Ruoyu
AU - Zhu, Xiaoyan
AU - Wang, Jiayin
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2025/5/23
Y1 - 2025/5/23
N2 - Currently, video is the most consumed content on the internet, with an increasing amount of information and knowledge shared as videos. This has led to a significant rise in video search requests. On Bilibili, one of China's largest knowledge-based video platforms, over half of daily users search for videos. Recent advancements in multimodal embeddings and text-to-video retrieval have shown promise, yet large-scale systems require search results within seconds. The high computational costs of image-based models have limited their scalability in video retrieval, with text-based searches remaining dominant and many video contents under-indexed. This paper presents a design that integrates ASR (Automatic Speech Recognition) text and uses large language models (LLMs) to optimize video search. ASR provides cost-effective video understanding and has been widely used for generating subtitles. However, in search scenarios, it suffers from noise and entity errors, impacting accuracy and leading to false retrieval. We leverage LLMs to generate high-quality text and summaries from original ASR text, integrating them into the video search engine to retrieval. LLMs are also used to enhance query understanding and relevance. Additionally, we enables direct answer generation from video summaries when watching is inconvenient for users. Offline evaluations and user experiments show significant improvements in search satisfaction while maintaining manageable computational costs. Deployed on Bilibili for a year, the enhanced video search engine has received daily feedback from millions of users, providing a best practice for using LLMs in video search and lessons for further optimization.
AB - Currently, video is the most consumed content on the internet, with an increasing amount of information and knowledge shared as videos. This has led to a significant rise in video search requests. On Bilibili, one of China's largest knowledge-based video platforms, over half of daily users search for videos. Recent advancements in multimodal embeddings and text-to-video retrieval have shown promise, yet large-scale systems require search results within seconds. The high computational costs of image-based models have limited their scalability in video retrieval, with text-based searches remaining dominant and many video contents under-indexed. This paper presents a design that integrates ASR (Automatic Speech Recognition) text and uses large language models (LLMs) to optimize video search. ASR provides cost-effective video understanding and has been widely used for generating subtitles. However, in search scenarios, it suffers from noise and entity errors, impacting accuracy and leading to false retrieval. We leverage LLMs to generate high-quality text and summaries from original ASR text, integrating them into the video search engine to retrieval. LLMs are also used to enhance query understanding and relevance. Additionally, we enables direct answer generation from video summaries when watching is inconvenient for users. Offline evaluations and user experiments show significant improvements in search satisfaction while maintaining manageable computational costs. Deployed on Bilibili for a year, the enhanced video search engine has received daily feedback from millions of users, providing a best practice for using LLMs in video search and lessons for further optimization.
KW - Automatic Speech Recognition
KW - Large Language Model
KW - Video Search Engine
UR - https://www.scopus.com/pages/publications/105009233601
U2 - 10.1145/3701716.3715220
DO - 10.1145/3701716.3715220
M3 - 会议稿件
AN - SCOPUS:105009233601
T3 - WWW Companion 2025 - Companion Proceedings of the ACM Web Conference 2025
SP - 601
EP - 610
BT - WWW Companion 2025 - Companion Proceedings of the ACM Web Conference 2025
PB - Association for Computing Machinery, Inc
T2 - 34th ACM Web Conference, WWW Companion 2025
Y2 - 28 April 2025 through 2 May 2025
ER -