TY - JOUR
T1 - Deep Web adaptive crawling based on minimum executable pattern
AU - Liu, Jun
AU - Jiang, Lu
AU - Wu, Zhaohui
AU - Zheng, Qinghua
PY - 2011/4
Y1 - 2011/4
N2 - The key to Deep Web Crawling is to submit valid input values to a query form and retrieve Deep Web content efficiently. In the literature, related work focus only on generic text boxes or entire query forms, causing the problem of "data islands" or inferior validity of query submission. This paper proposes the concept of Minimum Executable Pattern (MEP), a minimal combination of elements in a query form that can conduct a successful query, and then presents a MEPGeneration method and a MEP-based Deep Web adaptive crawling method. The query form is parsed and partitioned into MEP set, and then local-optimal queries are generated by choosing a MEP in the MEP set and a keyword vector of the MEP. Furthermore, the crawler can make a decision on its termination to balance the trade-off between high coverage of the content and resource consumption. The adoption of MEP is expected to improve the validity of query submission, and adaptive selection of multiple MEPs shows good effect for overcoming the problem of "data islands". We present a set of experiments to validate the effectiveness of the proposed method. Experimental results show that our method outperforms the state of art methods in terms of query capability and applicability, and on average, it achieves good coverage by issuing only a few hundred queries.
AB - The key to Deep Web Crawling is to submit valid input values to a query form and retrieve Deep Web content efficiently. In the literature, related work focus only on generic text boxes or entire query forms, causing the problem of "data islands" or inferior validity of query submission. This paper proposes the concept of Minimum Executable Pattern (MEP), a minimal combination of elements in a query form that can conduct a successful query, and then presents a MEPGeneration method and a MEP-based Deep Web adaptive crawling method. The query form is parsed and partitioned into MEP set, and then local-optimal queries are generated by choosing a MEP in the MEP set and a keyword vector of the MEP. Furthermore, the crawler can make a decision on its termination to balance the trade-off between high coverage of the content and resource consumption. The adoption of MEP is expected to improve the validity of query submission, and adaptive selection of multiple MEPs shows good effect for overcoming the problem of "data islands". We present a set of experiments to validate the effectiveness of the proposed method. Experimental results show that our method outperforms the state of art methods in terms of query capability and applicability, and on average, it achieves good coverage by issuing only a few hundred queries.
KW - Adaptive crawling
KW - Deep Web
KW - Deep Web surfacing
KW - Minimum executable pattern
UR - https://www.scopus.com/pages/publications/79952184634
U2 - 10.1007/s10844-010-0124-5
DO - 10.1007/s10844-010-0124-5
M3 - 文章
AN - SCOPUS:79952184634
SN - 0925-9902
VL - 36
SP - 197
EP - 215
JO - Journal of Intelligent Information Systems
JF - Journal of Intelligent Information Systems
IS - 2
ER -