About: StormCrawler

Property	Value
dbo:abstract	StormCrawler is an open-source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache License and is written mostly in Java (programming language). StormCrawler is modular and consists of a core module, which provides the basic building blocks of a web crawler such as fetching, parsing, URL filtering. Apart from the core components, the project also provides external resources, like for instance spout and bolts for Elasticsearch and Apache Solr or a ParserBolt which uses Apache Tika to parse various document formats. The project is used in production by various companies. Linux published a Q&A in October 2016 with the author of StormCrawler. InfoQ ran one in December 2016. A comparative benchmark with Apache Nutch was published in January 2017 on dzone.com. Several research papers mentioned the use of StormCrawler, in particular: * Crawling the German Health Web: Exploratory Study and Graph Analysis. * The generation of a multi-million page corpus for the Persian language. * The SIREN - Security Information Retrieval and Extraction engine. The project WIKI contains a list of videos and slides available online. StormCrawler is used notably by Common Crawl for generating a large and publicly available dataset of news. (en)
dbo:genre	dbr:Web_crawler
dbo:latestReleaseDate	2022-01-11 (xsd:date)
dbo:latestReleaseVersion	2.2
dbo:license	dbr:Apache_License
dbo:programmingLanguage	dbr:Java_(programming_language)
dbo:releaseDate	2014-09-11 (xsd:date)
dbo:wikiPageID	50900042 (xsd:integer)
dbo:wikiPageLength	4661 (xsd:nonNegativeInteger)
dbo:wikiPageRevisionID	1116865817 (xsd:integer)
dbo:wikiPageWikiLink	dbc:Software_using_the_Apache_license dbr:Common_Crawl dbr:Elasticsearch dbr:Apache_License dbr:Apache_Nutch dbr:Apache_Solr dbr:Apache_Storm dbr:Apache_Tika dbr:Linux.com dbr:Storm_(event_processor) dbc:Web_crawlers dbr:Web_crawler dbr:Java_(programming_language) dbc:Free_software_programmed_in_Java_(programming_language) dbr:Open-source_software
dbp:developer	DigitalPebble, Ltd. (en)
dbp:genre	dbr:Web_crawler
dbp:latestReleaseDate	2022-01-11 (xsd:date)
dbp:latestReleaseVersion	2.200000 (xsd:double)
dbp:license	dbr:Apache_License
dbp:name	StormCrawler (en)
dbp:programmingLanguage	dbr:Java_(programming_language)
dbp:released	2014-09-11 (xsd:date)
dbp:wikiPageUsesTemplate	dbt:Advert dbt:Infobox_software dbt:More_citations_needed dbt:Multiple_issues dbt:Notability dbt:Portal dbt:Primary_sources dbt:Reflist dbt:Start_date dbt:Start_date_and_age dbt:Url
dcterms:subject	dbc:Software_using_the_Apache_license dbc:Web_crawlers dbc:Free_software_programmed_in_Java_(programming_language)
rdf:type	owl:Thing dbo:Software schema:CreativeWork dbo:Work wikidata:Q386724 wikidata:Q7397
rdfs:comment	StormCrawler is an open-source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache License and is written mostly in Java (programming language). StormCrawler is modular and consists of a core module, which provides the basic building blocks of a web crawler such as fetching, parsing, URL filtering. Apart from the core components, the project also provides external resources, like for instance spout and bolts for Elasticsearch and Apache Solr or a ParserBolt which uses Apache Tika to parse various document formats. (en)
rdfs:label	StormCrawler (en)
owl:sameAs	wikidata:StormCrawler https://global.dbpedia.org/id/2LoGx
prov:wasDerivedFrom	wikipedia-en:StormCrawler?oldid=1116865817&ns=0
foaf:isPrimaryTopicOf	wikipedia-en:StormCrawler
foaf:name	StormCrawler (en)
is dbo:wikiPageWikiLink of	dbr:Web_crawler dbr:Web_ARChive
is foaf:primaryTopic of	wikipedia-en:StormCrawler