Table of Contents Introduction Text Preprocessing Tasks Models Perplexity Introduction Text Preprocessing Character-level tokenization Word-level tokenization Subword tokenization Stopwords Batching Padding Unsupervised Pre-Training Autoregression BERT loss Tasks Text Classification Named Entity Recognition Question Answering Summarization Translation Text Generation Text Preprocessing Text preprocessing is an essential step in NLP that involves cleaning and transforming unstructured text data to prepare it for analysis. Some common text preprocessing techniques include:
Expanding contractions (e.g., “don’t” to “do not”) [7] Lowercasing text[7] Removing punctuations[7] Removing words and digits containing digits[7] Removing stopwords (common words that do not carry much meaning) [7] Rephrasing text[7] Stemming and Lemmatization (reducing words to their root forms) [7] Common Tokenizers Tokenization is the process of breaking a stream of textual data into words, terms, sentences, symbols, or other meaningful elements called tokens. Some common tokenizers used in NLP include: