Text segmentation is the process of dividing written text into words or other similar meaningful units, such as sentences or topics. The term applies to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing.
| Property | Value |
| p:abstract
| - Text segmentation is the process of dividing written text into words or other similar meaningful units, such as sentences or topics. The term applies to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing.
The problem may appear relatively trivial for written languages that have explicit word boundary markers, such as the word spaces of written English or the distinctive initial, medial and final letter shapes of Arabic. When such clues are not consistently available, the task often requires fairly non-trivial techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints.
Natural Language Processing (NLP) text segmentation techniques involves determining the boundaries between words and sentences. This process is not as simple as finding periods (a period may appear for example in a dollar amount), semicolons (may appear for example in an XML entity tag), etc.
When processing plain text, tables of abbreviations that contain periods (Mr. for example) can help prevent incorrect assignment of sentence boundaries. Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.
A document may contain multiple topics, and the task of computerized text segmentation may be to discover these topics automatically and segment the text accordingly.
The topic boundaries may be apparent from section titles and paragraphs.
In other cases one needs to use techniques similar to those used in document classification.
Many different approaches have been tried.
Effective Natural Language Processing systems and text segmentation tools usually operate on text in specific domains and sources. As an example, processing text used in medical records is a very different problem than processing news articles or real estate advertisements.
The process of writing text segmentation tools starts with collecting a large corpus of text in an application domain. There are two general approaches:
* Manual analysis of text and writing custom software
* Annotate the sample corpus with boundary information and use Machine Learning (en)
- Unter morphologischer Analyse versteht man in der Computerlinguistik ein Verfahren, welches die morphologischen, syntaktischen und evtl. semantischen Eigenschaften von Wörtern ermittelt. Im Einzelnen können morphologische Analyseverfahren die folgenden Teilaufgaben lösen:
# Segmentierung, d.h. Aufteilung von komplexen Wörtern in freie und gebundene Morpheme. Zu letzteren zählen Präfixe und Suffixe.
# Lemmatisierung: Zurückführung eines einfachen oder komplexen Wortes auf sein Lemma und Ermittlung seiner syntaktischen Eigenschaften. Beispiel: Das Wort "Häusern" wird auf sein Lemma "Haus" mit den Eigenschaften {Nomen, Plural, Dativ} reduziert.
# Ermittlung der Wortstruktur; diese wird oft in Zusammenhang mit einer wortsemantischen Analyse bestimmt. (de)
- 形態素解析(けいたいそかいせき、Morphological Analysis)とは、コンピュータ等の計算機を用いた自然言語処理の基礎技術のひとつ。かな漢字変換等にも応用されている。
対象言語の文法の知識(文法のルールの集まり)や辞書(品詞等の情報付きの単語リスト)を情報源として用い、自然言語で書かれた文を形態素(Morpheme, おおまかにいえば、言語で意味を持つ最小単位)の列に分割し、それぞれの品詞を判別する作業を指す。
以下は「お待ちしております」という文を形態素解析した例である (形態素解析ツールには「茶筌」を使用した)。 (ja)
|
| p:hasPhotoCollection
| |
| p:reference
| |
| p:wikipage-de
| |
| p:wikipage-ja
| |
| rdfs:comment
| - Text segmentation is the process of dividing written text into words or other similar meaningful units, such as sentences or topics. The term applies to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing. (en)
- Unter morphologischer Analyse versteht man in der Computerlinguistik ein Verfahren, welches die morphologischen, syntaktischen und evtl. semantischen Eigenschaften von Wörtern ermittelt. Im Einzelnen können morphologische Analyseverfahren die folgenden Teilaufgaben lösen: (de)
- 形態素解析(けいたいそかいせき、Morphological Analysis)とは、コンピュータ等の計算機を用いた自然言語処理の基礎技術のひとつ。かな漢字変換等にも応用されている。 (ja)
|
| rdfs:label
| - Text segmentation (en)
- Morphologische Analyse (Computerlinguistik) (de)
- 形態素解析 (ja)
|
| skos:subject
| |
| foaf:page
| |
| p:redirect
| |