About: Europarl Corpus

An Entity of Type: Thing, from Named Graph: http://dbpedia.org, within Data Space: dbpedia.org

The Europarl Corpus is a corpus (set of documents) that consists of the proceedings of the European Parliament from 1996 to 2012. In its first release in 2001, it covered eleven official languages of the European Union (Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish, and Swedish). With the political expansion of the EU the official languages of the ten new member states have been added to the corpus data. The latest release (2012) comprised up to 60 million words per language with the newly added languages being slightly underrepresented as data for them is only available from 2007 onwards. This latest version includes 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Sl

Property	Value
dbo:abstract	The Europarl Corpus is a corpus (set of documents) that consists of the proceedings of the European Parliament from 1996 to 2012. In its first release in 2001, it covered eleven official languages of the European Union (Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish, and Swedish). With the political expansion of the EU the official languages of the ten new member states have been added to the corpus data. The latest release (2012) comprised up to 60 million words per language with the newly added languages being slightly underrepresented as data for them is only available from 2007 onwards. This latest version includes 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavic (Bulgarian, Czech, Polish, Slovak, Slovene), Finno-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek. The data that makes up the corpus was extracted from the website of the European Parliament and then prepared for linguistic research. After sentence splitting and tokenization the sentences were aligned across languages with the help of an algorithm developed by Gale & Church (1993). The corpus has been compiled and expanded by a group of researchers led by Philipp Koehn at the University of Edinburgh. Initially, it was designed for research purposes in statistical machine translation (SMT). However, since its first release it has been used for multiple other research purposes, including for example word sense disambiguation. EUROPARL is also available to search via the corpus management system Sketch Engine. (en)
dbo:wikiPageExternalLink	http://opus.lingfil.uu.se/Europarl.php http://www.statmt.org/europarl
dbo:wikiPageID	36200511 (xsd:integer)
dbo:wikiPageLength	6490 (xsd:nonNegativeInteger)
dbo:wikiPageRevisionID	1110423993 (xsd:integer)
dbo:wikiPageWikiLink	dbr:Enlargement_of_the_European_Union dbc:Corpora dbr:Gale-Church_alignment_algorithm dbr:Target_language_(translation) dbr:Linguistic dbr:Sketch_Engine dbr:Back_translation dbr:Tokenization_(lexical_analysis) dbc:European_Parliament dbr:European_Parliament dbr:European_Union dbr:Text_corpus dbr:BLEU dbr:Philipp_Koehn dbr:Statistical_machine_translation dbr:Word-sense_disambiguation dbr:Evaluation_of_machine_translation
dbp:wikiPageUsesTemplate	dbt:Reflist dbt:Short_description dbt:Corpus_linguistics
dcterms:subject	dbc:Corpora dbc:European_Parliament
rdfs:comment	The Europarl Corpus is a corpus (set of documents) that consists of the proceedings of the European Parliament from 1996 to 2012. In its first release in 2001, it covered eleven official languages of the European Union (Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish, and Swedish). With the political expansion of the EU the official languages of the ten new member states have been added to the corpus data. The latest release (2012) comprised up to 60 million words per language with the newly added languages being slightly underrepresented as data for them is only available from 2007 onwards. This latest version includes 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Sl (en)
rdfs:label	Europarl Corpus (en)
owl:sameAs	yago-res:Europarl Corpus wikidata:Europarl Corpus https://global.dbpedia.org/id/4k5US
prov:wasDerivedFrom	wikipedia-en:Europarl_Corpus?oldid=1110423993&ns=0
foaf:isPrimaryTopicOf	wikipedia-en:Europarl_Corpus
is dbo:wikiPageRedirects of	dbr:Europarl_corpus
is dbo:wikiPageWikiLink of	dbr:Google_Translate dbr:Europarl_corpus dbr:List_of_text_corpora
is foaf:primaryTopic of	wikipedia-en:Europarl_Corpus