Skip to content

Basic Text Processing

rameshjesswani edited this page Jun 16, 2017 · 16 revisions

Regular Expressions

  • It is formal language to specify text strings.
  • It is useful to search the words in the text.
  • It is a set of characters that specify a pattern, to get particular words from the text.
    • Example: To find "the" or "The" from the text, this pattern is used : [^a-zA-Z][tT]he[^a-zA-Z]
  • In NLP, Regexps is useful in many tasks such as: information extraction, speech recognition etcetra.
  • Regexps is used in NLP:
    • To verify data fields e.g: emails, passwords, URLs.
    • To filter the text e.g: spam detection, or filter the specific URLs.
    • Identify particular strings from the text.
  • For many difficult tasks, machine learning classifiers are used:
    • Where regular expressions are used as features for classifiers.

Text Normalization

  • Before any natural language processing of a text, text has to be normalized.
  • At least three steps are applied to normalize the text:
    • Segmenting/tokenizing words from running text
    • Normalizing word formats
    • Segmenting sentences in running text

i. Word Tokenization

  • It is a process of chopping the words from the given text document.
  • Tokens from the text do not contain the whitespace characters or may be punctuation characters.
  • Number of words in text?
    • Example: "I wan- want to go ah university", how do we know what words to be counted? as "ah" is natural word
  • ** This is tricky for the speech processing
  • Words can be counted based on uniqueness as well as total number(no matter whether same word occur two times):
    • Type: An element of the vocabulary
    • Token: An instance of that type in running text
    • Example: They went to the San Francisco and enjoyed their vacations in the winter
    • Token: Tokenization considers "San Francisco" as two words.
    • Type: Can count "the" one time, "they" and "their" one word because of same lemma, and "San Francisco" as two words.
  • Issues in Tokenization
  • "Germany's captial": Germany, Germanys, Germany's
  • What're, I'm, isn't: What are, I am , is not
  • "San Francisco" : One word or two
  • This needs to be defined according to standard, how we want to tokenize them.
  • There is issue for tokenization in other languages such as Chinese and Japanese, where words do not have spaces.
  • However, there are many algorithms to resolve issues in Chinese such as Maximum matching(also called Greedy) that helps to segment the words.

ii. Normalizing word formats

  • Normalization is a process to transform list of words/tokens in standard format.
  • This process makes task of performing other operations easier.
  • List of operations can be performed in normalization are:
    • Convert Characters to Lowercase: This can simplify task for searching, but there are exceptions to convert all characters to lowercase. For example: "USA" for "United States of America" cannot be written as "usa".
    • Expand Abbreviations
    • Remove Stopwords: Most stopwords such as "the", "a", "of", "in", "on" do not have much meaning. Therefore, these words should be removed as it can save space in index and can make indexing process faster. On the other hand, there are some search engines which do not remove stopwords because they can be useful for some queries. For example, when exact match between two parts of text is performed then removal of stopwords can cause problem.
    • Lemmatization and Stemming:
      • Lemmatization is the criteria of finding that two words have the same root, despite their surface differences.
      • Example: Words like "am", "are", and "is" have same lemma "be". While words like "car" and "cars" shares same lemma "car"
      • Example of lemmatized form of sentence like "They are reading detective stories" will be "They be read detective story"
      • Process of Lemmatization(How it is done?):
        • Lemmatization considers complete morphological parsing of the word.
        • Morphology is a process of determining how words are built up from smaller meaning-bearing units known as morphemes.
        • Morphemes is divided into two classes:
          • Stems: It is central morpheme of the word, that provides main meaning to the word.
          • Affixes: It adds additional meanings of various kinds.
          • Example: Morphological parser takes a word like "cars" and parses it into morpheme "car" and another morpheme "-s", where "car" has "stems" class and "-s" has "Affixes" class.
        • To deal with morphological variations in word forms, stemming algorithm known as Porter Stemmer is used.

iii. Sentence Segmentation or Sentence Boundary Detection

  • It is a process of dividing the text into sentences.
  • Most useful signs for dividing the text into sentences are punctuation like periods, question marks, exclamation points.
  • However periods can denote the an abbreviation, decimal point or an email address but not end of the sentences.
  • Sentence tokenization methods tools use one of the following methods:
    • Machine learning (unsupervised and supervised)
    • Rule-based
  • Using these methods, it can be determined that period is part of the word or is a sentence end marker.

Summary

How Text Normalization can help in Semantic Similarity

  • Firstly, sentence segmentation is performed on the text.

  • Secondly, word tokenization is performed on the tokenized sentences.

  • Thirdly, Text normalization(such as lemmatization, stemming, remove stopwords, convert characters to lowercase) is performed on the tokenized words.

    • For the case of text normalization, it depends upon the application how you want to normalize them. For example, if one wants to compare keywords(words having some meaning), then stop words can be removed.
    • Lemmatizer and stemmers are used to convert inflected words into same form.
    • Stemmers use algorithms to remove prefixes and suffixes, and result may not be a dictionary word
    • Lemmatizers use a corpus, and result is always a dictionary word. Moreover, lemmatizers need extra information about the part of speech of the word. Like, "Calling" can be either a verb or a noun(the calling)
    • When to use Lemmatizer and when to use Stemmer
      • It depends upon application, if speed is important, then we should use stemmers, as lemmatizers have to search through corpus while stemmers do simple operations on a string.
      • In case, if you want to return dictionary words, then lemmatizers can be used.
  • Fourth step can be to compute similarity based on strings.

  • What we already have in Natural Language Processing Toolkit(NLTK)

    • From an implementation point of view, NLTK has already implemented porter stemmer algorithm for stemming, and WordNet coprus for lemmatizers.
    • NLTK has functions from word tokenizers, sentence tokenizers and list of stopwords in English.

For the next task, I will compute distances between words, how similar two strings are, by using String-Based similarity approaches.

References:

Clone this wiki locally