About: Focused crawler

Facets (new session)
Description
Metadata
Settings
- Rule:
- Inverse Functional Properties:
- "Same As":

About: Focused crawler Goto Sponge NotDistinct Permalink

An Entity of Type : yago:Whole100003553, within Data Space : dbpedia.org associated with source document(s)
QRcode icon

http://dbpedia.org/describe/?url=http%3A%2F%2Fdbpedia.org%2Fresource%2FFocused_crawler

A focused crawler is a web crawler that collects Web pages that satisfy some specific property, by carefully prioritizing the crawl frontier and managing the hyperlink exploration process. Some predicates may be based on simple, deterministic and surface properties. For example, a crawler's mission may be to crawl pages from only the .jp domain. Other predicates may be softer or comparative, e.g., "crawl pages about baseball", or "crawl pages with large PageRank". An important page property pertains to topics, leading to 'topical crawlers'. For example, a topical crawler may be deployed to collect pages about solar power, swine flu, or even more abstract concepts like controversy while minimizing resources spent fetching pages on other topics. Crawl frontier management may not be the only

Attributes	Values
rdf:type	software yago:WikicatWebCrawlers yago:CausalAgent100007347 yago:Flatterer110095869 yago:Follower110099375 yago:LivingThing100004258 yago:Object100002684 yago:Organism100004475 yago:Person100007846 yago:PhysicalEntity100001930 yago:YagoLegalActor yago:YagoLegalActorGeo yago:Sycophant110684827 yago:Whole100003553
rdfs:label	زاحف مركز (ar) Focused crawler (en)
rdfs:comment	الزاحف المركز أو الزاحف الموضعي هو زاحف الويب الذي يحاول تحميل صفحات الويب التي لها صلة بموضوع محدد مسبقا فقط أو مجموعة من المواضيع. يفترض الزاحف المركز عامة أن الموضوع فقط هو المعطى بينما يفترض الزحف المركز أيضا أن بعض الأمثلة المسماة المتعلقة والغير متعلقة بالصفحات متاحة. قدم الزحف الموضعي لأول مرة منكزر. (ar) A focused crawler is a web crawler that collects Web pages that satisfy some specific property, by carefully prioritizing the crawl frontier and managing the hyperlink exploration process. Some predicates may be based on simple, deterministic and surface properties. For example, a crawler's mission may be to crawl pages from only the .jp domain. Other predicates may be softer or comparative, e.g., "crawl pages about baseball", or "crawl pages with large PageRank". An important page property pertains to topics, leading to 'topical crawlers'. For example, a topical crawler may be deployed to collect pages about solar power, swine flu, or even more abstract concepts like controversy while minimizing resources spent fetching pages on other topics. Crawl frontier management may not be the only (en)
dcterms:subject	Web crawlers Internet search algorithms
Wikipage page ID	11442799 (xsd:integer)
Wikipage revision ID	1119838584 (xsd:integer)
Link from a Wikipage to another Wikipage	Microdata (HTML) Andrew McCallum Inverted index Web crawlers Web crawler Internet search algorithms Domain name RDFa PageRank Web directory Search engine Reinforcement learning Backlink Crawl frontier Filippo Menczer DOM tree Microformats Whitelist Uniform resource locator Breadth-first
sameAs	Focused crawler Focused crawler Focused crawler Focused crawler Focused crawler
dbp:wikiPageUsesTemplate	dbt:Web_crawlers dbt:Reflist dbt:Internet_search
has abstract	الزاحف المركز أو الزاحف الموضعي هو زاحف الويب الذي يحاول تحميل صفحات الويب التي لها صلة بموضوع محدد مسبقا فقط أو مجموعة من المواضيع. يفترض الزاحف المركز عامة أن الموضوع فقط هو المعطى بينما يفترض الزحف المركز أيضا أن بعض الأمثلة المسماة المتعلقة والغير متعلقة بالصفحات متاحة. قدم الزحف الموضعي لأول مرة منكزر. (ar) A focused crawler is a web crawler that collects Web pages that satisfy some specific property, by carefully prioritizing the crawl frontier and managing the hyperlink exploration process. Some predicates may be based on simple, deterministic and surface properties. For example, a crawler's mission may be to crawl pages from only the .jp domain. Other predicates may be softer or comparative, e.g., "crawl pages about baseball", or "crawl pages with large PageRank". An important page property pertains to topics, leading to 'topical crawlers'. For example, a topical crawler may be deployed to collect pages about solar power, swine flu, or even more abstract concepts like controversy while minimizing resources spent fetching pages on other topics. Crawl frontier management may not be the only device used by focused crawlers; they may use a Web directory, a Web text index, backlinks, or any other Web artifact. A focused crawler must predict the probability that an unvisited page will be relevant before actually downloading the page. A possible predictor is the anchor text of links; this was the approach taken by Pinkerton in a crawler developed in the early days of the Web. Topical crawling was first introduced by Filippo Menczer. Chakrabarti et al. coined the term 'focused crawler' and used a text classifier to prioritize the crawl frontier. Andrew McCallum and co-authors also used reinforcement learning to focus crawlers. Diligenti et al. traced the context graph leading up to relevant pages, and their text content, to train classifiers. A form of online reinforcement learning has been used, along with features extracted from the DOM tree and text of linking pages, to continually train classifiers that guide the crawl. In a review of topical crawling algorithms, Menczer et al. show that such simple strategies are very effective for short crawls, while more sophisticated techniques such as reinforcement learning and evolutionary adaptation can give the best performance over longer crawls. It has been shown that spatialinformation is important to classify Web documents. Another type of focused crawlers is semantic focused crawler, which makes use of domain ontologies to represent topical maps and link Web pages with relevant ontological concepts for the selection and categorization purposes. In addition, ontologies can be automatically updated in the crawling process. Dong et al. introduced such an ontology-learning-based crawler using support vector machine to update the content of ontological concepts when crawling Web Pages. Crawlers are also focused on page properties other than topics. Cho et al. study a variety of crawl prioritization policies and their effects on the link popularity of fetched pages. Najork and Weiner show that breadth-first crawling, starting from popular seed pages, leads to collecting large-PageRank pages early in the crawl. Refinements involving detection of stale (poorly maintained) pages have been reported by Eiron et al.A kind of semantic focused crawler, making use of the idea of reinforcement learning has been introduced by Meusel et al. using online-based classification algorithms in combination with a bandit-based selection strategy to efficiently crawl pages with markup languages like RDFa, Microformats, and Microdata. The performance of a focused crawler depends on the richness of links in the specific topic being searched, and focused crawling usually relies on a general web search engine for providing starting points. Davison presented studies on Web links and text that explain why focused crawling succeeds on broad topics; similar studies were presented by Chakrabarti et al. Seed selection can be important for focused crawlers and significantly influence the crawling efficiency. A whitelist strategy is to start the focus crawl from a list of high quality seed URLs and limit the crawling scope to the domains of these URLs. These high quality seeds should be selected based on a list of URL candidates which are accumulated over a sufficient long period of general web crawling. The whitelist should be updated periodically after it is created. (en)
gold:hypernym	Crawler
prov:wasDerivedFrom	wikipedia-en:Focused_crawler?oldid=1119838584&ns=0
page length (characters) of wiki page	9866 (xsd:nonNegativeInteger)
foaf:isPrimaryTopicOf	wikipedia-en:Focused_crawler
is Link from a Wikipage to another Wikipage of	Topical crawler Topical crawlers GenieKnows Filippo Menczer Robots exclusion standard Soumen Chakrabarti Vertical search Focused crawlers Focused crawling Focused web crawler
is Wikipage redirect of	Topical crawler Topical crawlers Focused crawlers Focused crawling Focused web crawler
is foaf:primaryTopic of	wikipedia-en:Focused_crawler

Faceted Search & Find service v1.17_git139 as of Feb 29 2024

Alternative Linked Data Documents: ODE Content Formats:

RDF

ODATA

Microdata

About

OpenLink Virtuoso version 08.03.3330 as of Mar 19 2024, on Linux (x86_64-generic-linux-glibc212), Single-Server Edition (61 GB total memory, 49 GB memory in use)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2024 OpenLink Software