Natural Language Processing (NLP)
A Computer Science field connected to Artificial Intelligence and Computational Linguistics which focuses on interactions between computers and human language and a machine’s ability to understand, or mimic the understanding of human language
See:
Resources
- https://en.wikipedia.org/wiki/Natural_language_processing
- NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner. By utilizing NLP and its components, one can organize the massive chunks of text data, perform numerous automated tasks and solve a wide range of problems such as – automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc.
- https://github.com/keon/awesome-nlp
- The most important NLP highlights of 2018
- NLP - Udemy ML
- https://www.analyticsvidhya.com/blog/2017/01/ultimate-guide-to-understand-implement-natural-language-processing-codes-in-python/
- https://www.datascience.com/blog/introduction-to-natural-language-processing-lexical-units-learn-data-science-tutorials
- https://github.com/BotCube/awesome-bots
- http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
- Hidden Markov model
- Speech recognition
Feature engineering, text preprocessing
See Embeddings and text representations
- https://www.geeksforgeeks.org/feature-extraction-techniques-nlp/
- https://www.analyticsvidhya.com/blog/2017/01/ultimate-guide-to-understand-implement-natural-language-processing-codes-in-python/
- http://nitin-panwar.github.io/Text-prepration-before-Sentiment-analysis/
- Removing numbers, Urls and Links, stopwords
- Stemming words
- Suffix-dropping algorithms
- Lemmatisation algorithms
- N-gram analysis
- Removing punctuation
- Stripping whitespace
- Checking for impure characters
- http://thinknook.com/10-ways-to-improve-your-classification-algorithm-performance-2013-01-21/
Semantics
Distributional semantics
- General recipe:
- form a word-context matrix of counts (data)
- perform dimensionality reduction (SVD) for generalization
- For LSA the context is the document where the word appears.
- For word2vec the context is just a work, nearby words (in some window) in a document.
- Latent semantic analysis
- The process of analyzing relationships between a set of documents and the terms they contain. Accomplished by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text.
- Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.
- LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and SVD is used to reduce the number of rows while preserving the similarity structure among columns. Words are then compared by taking the cosine of the angle between the two vectors (or the dot product between the normalizations of the two vectors) formed by any two rows. Values close to 1 represent very similar words while values close to 0 represent very dissimilar words.
- http://mccormickml.com/2016/03/25/lsa-for-text-classification-tutorial/
- https://github.com/chrisjmccormick/LSA_Classification/blob/master/runClassification_LSA.py
- http://stackoverflow.com/questions/30590881/python-lsa-with-sklearn
Topic Modelling
- https://en.wikipedia.org/wiki/Topic_model
- Latent Dirichlet Allocation
- A common topic modeling technique, LDA is based on the premise that each document or piece of text is a mixture of a small number of topics and that each word in a document is attributable to one of the topics.
- http://engineering.flipboard.com/2017/02/storyclustering
- http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
- http://ahmedbesbes.com/how-to-mine-newsfeed-data-and-extract-interactive-insights-in-python.html
Neural semantic parsing
Explicit semantic analysis
- https://en.wikipedia.org/wiki/Explicit_semantic_analysis
- In NLP and information retrieval, explicit semantic analysis (ESA) is a vectorial representation of text (individual words or entire documents) that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text corpus and a document (string of words) is represented as the centroid of the vectors representing its words. Typically, the text corpus is English Wikipedia, though other corpora including the Open Directory Project have been used.
- Used in Information Retrieval, Document Classification and Semantic Relatedness calculation (i.e. how similar in meaning two words or pieces of text are to each other), ESA is the process of understanding the meaning of a piece text, as a combination of the concepts found in that text.
- Corpus or Corpora. A usually large collection of documents that can be used to infer and validate linguistic rules, as well as to do statistical analysis and hypothesis testing.
Sentiment analysis
- https://en.wikipedia.org/wiki/Sentiment_analysis
- The use of NLP techniques to extract subjective information from a piece of text. i.e. whether an author is being subjective or objective or even positive or negative
- http://varianceexplained.org/r/trump-tweets/
- http://blog.aylien.com/sentiment-analysis-of-2-2-million-tweets-from-super-bowl-51/
Deep learning-based
- Modern Deep Learning Techniques Applied to Natural Language Processing
- https://github.com/brianspiering/awesome-dl4nlp
- Deep Learning in NLP
- https://softwaremill.com/deep-learning-for-nlp/
- http://blog.aylien.com/modeling-documents-generative-adversarial-networks/
- New AI classifier for indicating AI-written text
CNN-based
- Convolutional Neural Network for Sentence Classification
- http://www.kdnuggets.com/2017/05/deep-learning-extract-knowledge-job-descriptions.html
- How to read: Character level deep learning
- #PAPER Connectionist Temporal Classification (Hannun 2017)
RNN-based
- RNN for NLP
- http://www.abigailsee.com/2017/04/16/taming-rnns-for-better-summarization.html
- #PAPER RRA: Recurrent Residual Attention for Sequence Learning (Wang 2017)
Seq2seq
- #PAPER Sequence to Sequence Learning with Neural Networks (Sustkever 2014)
- Sequence-to-sequence models are deep learning models that have achieved a lot of success in tasks like machine translation, text summarization, and image captioning. Google Translate started using such a model in production in late 2016. These models are explained in the two pioneering papers (Sutskever et al., 2014, Cho et al., 2014).
- A sequence-to-sequence model is a model that takes a sequence of items (words, letters, features of an images…etc) and outputs another sequence of items.
- Under the hood, the model is composed of an encoder and a decoder. The encoder processes each item in the input sequence, it compiles the information it captures into a vector (called the context). After processing the entire input sequence, the encoder sends the context over to the decoder, which begins producing the output sequence item by item.
- The context is a vector (an array of numbers, basically) in the case of machine translation. The encoder and decoder tend to both be recurrent neural networks.
- https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
- Encoder-Decoder LSTMs for sequence-to-sequence prediction
Google Neural Machine Translation
- https://en.wikipedia.org/wiki/Google_Neural_Machine_Translation
- #PAPER Zero-shot translation
- Google Neural Machine Translation (GNMT) is a neural machine translation (NMT) system developed by Google and introduced in November 2016, that uses an artificial neural network to increase fluency and accuracy in Google Translate.
- https://research.googleblog.com/2016/11/zero-shot-translation-with-googles.html
Transformer-based
See AI/Deep Learning/Transformers#For NLP
Books
- #BOOK Natural Language Processing with Python (Bird, 2013 OREILLY)
- #BOOK Mastering NLP with Python (Chopra, 2016 PACKT)
- #BOOK An Introduction to Information Retrieval (Manning 2009, CAMBRIDGE)
- #BOOK Text mining with R (Silge, 2020 OREILLY)
Courses
- #COURSE Neural networks for NLP (Carnegie Mellon)
- #COURSE NLP (Standford 15)
- #COURSE NLP with Deep Learning (Standford 16,17)
- #COURSE Natural Language Understanding (Standford 16)
- #COURSE Deep Learning for NLP (Oxford/Deepmind 17)
- #COURSE YSDA Natural Language Processing course (Yandex)
Talks
- #TALK Introduction to Natural Language Processing - Cambridge Data Science Bootcamp
- #TALK Rob Romijnders | Using deep learning in natural language processing (PyData)
- #TALK Jeff Abrahamson - WTF am I doing? An introduction to NLP and ANN's
- #TALK Natural Language Processing with PySpark
- #TALK Feeding Word2vec with tens of billions of items, what could possibly go wrong? (Simon Dollé)
- #TALK Deep Learning for Natural Language Processing (2015)
Code
- #CODE Arabica - A Python package for exploratory analysis of text data
- #CODE Rubrix - Rubrix, open-source framework for data-centric NLP. Data annotation and monitoring for enterprise NLP
- #CODE Beir - Heterogeneous Benchmark for Information Retrieval
- #CODE FastText - Library for efficient text classification and representation learning
- #CODE Fairseq - Facebook AI Research Sequence-to-Sequence Toolkit written in Python
- Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks
- #CODE OpenNMT-tf - OpenNMT-tf is a general purpose sequence learning toolkit using TensorFlow 2
- #CODE OpenNLP - The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text
- #CODE Textgen RNN
- #CODE Standford CoreNLP
- #CODE NLTK - NLTK is a leading platform for building Python programs to work with human language data
- #CODE Textblob - TextBlobis a Python library for processing textual data
- #CODE Spacy - Industrial-strength NLP
- #TALK Matthew Honnibal - Designing spaCy: Industrial-strength NLP
- #TALK Patrick Harrison | Modern NLP in Python (SpaCy and gensim for recommendation-reviews analysis)
- https://spacy.io/docs/usage/tutorials
- https://nicschrading.com/project/Intro-to-NLP-with-spaCy/
- http://blog.thedataincubator.com/2016/04/nltk-vs-spacy-natural-language-processing-in-python/
- #CODE ParlAI - A unified platform for sharing, training and evaluating dialogue models across many tasks
- #CODE Gensim - Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora
- #CODE Spark-NLP - State-of-the-art Natural Language Processing library built on top of Apache Spark
OCR
- #CODE Pytesseract - Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and "read" the text embedded in images.
- #CODE Tesseract
- #CODE Docling - Get your documents ready for gen AI
Speech
See GenAI for audio
- #CODE PaddleSpeech - toolkit for tasks in speech and audio, with the state-of-art and influential models
Web scrapping and cleaning
- #CODE Requests (For fetching HTML/XML from web pages)
- #CODE BeautifulSoup (web scraping data parsing)
- #CODE LXML (web scraping data parsing)
- #CODE Dryscape (web scraping with javascript)
- #CODE Selenium (web scraping with javascript)
- #CODE Scrapy (web scraping framework)
- https://doc.scrapy.org/en/latest/intro/overview.html
- Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
- https://medium.com/@kaismh/extracting-data-from-websites-using-scrapy-e1e1e357651a#.j9hrs2scn
- #CODE python-ftfy: fixes text for you
- #CODE Arrow - working with dates and times
- #CODE Beautifier - clean and prettify URLs and email addresses