百度今日还宣布了“阿拉丁平台”计划,该计划为解决现有搜索无法抓取和检索互联网上存在的大量hidden web(暗网)问题。据悉,百度已经投入超过1千人研发该平台。
于是在网上搜了下有关“暗网”的资料。发现了一些英文资料。
Current-day crawlers retrieve content only from
 the publicly indexable Web, i.e., the set of Web
 pages reachable purely by following hypertext
 links, ignoring search forms and pages that require
 authorization or prior registration. In particular,
 they ignore the tremendous amount of high quality
 content “hidden” behind search forms, in large
 searchable electronic databases. In this paper, we
 address the problem of designing a crawler capable
 of extracting content from this hidden Web.
 We introduce a generic operational model of a
 hidden Web crawler and describe how this model
 is realized in HiWE (Hidden Web Exposer), a
 prototype crawler built at Stanford. We introduce
 a new Layout-based Information Extraction
 Technique (LITE) and demonstrate its use in automatically
 extracting semantic information from
 search forms and response pages. We also present
 results from experiments conducted to test and
 validate our techniques.
转载于:https://www.cnblogs.com/istep/archive/2008/12/18/1357589.html