[RubyML | RubyDataScience | RubyInterop]
Awesome NLP with Ruby
Useful resources for text processing in Ruby
This curated list comprises awesome resources, libraries, information sources about computational processing of texts in human languages with the Ruby programming language. That field is often referred to as NLP, Computational Linguistics, HLT (Human Language Technology) and can be brought in conjunction with Artificial Intelligence, Machine Learning, Information Retrieval, Text Mining, Knowledge Extraction and other related disciplines.
This list comes from our day to day work on Language Models and NLP Tools. Read why this list is awesome. Our FAQ describes the important decisions and useful answers you may be interested in.
Our main goal is to promote Ruby as a tool for NLP related tasks. Your help,
suggestions and contributions are welcome! We kindly ask you to study
the Contribution section. Follow us on Twitter
and please spread the word using the #RubyNLP
hash tag!
NLP Pipeline Subtasks
An NLP Pipeline starts with a plain text.
Pipeline Generation
- composable_operations - Definition framework for operation pipelines.
- ruby-spark - Spark bindings with an easy to understand DSL.
- phobos - Simplified Ruby Client for Apache Kafka.
Multipurpose Engines
- open-nlp - Ruby Bindings for the OpenNLP Toolkit.
- stanford-core-nlp - Ruby Bindings for the Stanford CoreNLP tools.
- treat - Natural Language Processing framework for Ruby (like NLTK for Python).
- nlp_toolz - Wrapper over some OpenNLP classes and the original Berkeley Parser.
- open_nlp - JRuby Bindings for the OpenNLP Toolkit.
On-line APIs
- alchemyapi_ruby - Legacy Ruby SDK for AlchemyAPI/Bluemix.
- wit-ruby - Ruby client library for the Wit.ai Language Understanding Platform.
- wlapi - Ruby client library for Wortschatz Leipzig web services.
- monkeylearn-ruby - Sentiment Analysis, Topic Modelling, Language Detection, Named Entity Recognition via a Ruby based Web API client.
Language Identification
Language Identification is one of the first crucial steps in every NLP Pipeline.
- scylla - Language Categorization and Identification.
Segmentation
Tools for Tokenization, Word and Sentence Boundary Detection and Disambiguation.
- tokenizer - Simple multilingual tokenizer. [tutorial]
- pragmatic_tokenizer - Multilingual tokenizer to split a string into tokens.
- nlp-pure - Natural language processing algorithms implemented in pure Ruby with minimal dependencies.
- textoken - Simple and customizable text tokenization library.
- pragmatic_segmenter - Word Boundary Disambiguation with many cookies.
- punkt-segmenter - Pure Ruby implementation of the Punkt Segmenter.
- tactful_tokenizer - RegExp based tokenizer for different languages.
- scapel - Sentence Boundary Disambiguation tool.
Lexical Processing
Stemming
Stemming is the term used in information retrieval to describe the process for
reducing wordforms to some base representation. Stemming should be distinguished
from Lemmatization since stems
are not necessarily have
linguistic motivation.
- ruby-stemmer - Ruby-Stemmer exposes the SnowBall API to Ruby.
- uea-stemmer - Conservative stemmer for search and indexing.
Lemmatization
Lemmatization is considered a process of finding a base form of a word. Lemmas are often collected in dictionaries.
- lemmatizer - WordNet based Lemmatizer for English texts.
Lexical Statistics: Counting Types and Tokens
- wc - Facilities to count word occurrences in a text.
- word_count -
Word counter for
String
andHash
objects. - words_counted - Pure Ruby library counting word statistics with different custom options.
Filtering Stop Words
- stopwords-filter - Filter and Stop Word Lexicon based on the SnowBall lemmatizer.
Phrasal Level Processing
- n_gram - N-Gram generator.
- ruby-ngram - Break words and phrases into ngrams.
- raingrams - Flexible and general-purpose ngrams library written in pure Ruby.
Syntactic Processing
Constituency Parsing
- stanfordparser - Ruby based wrapper for the Stanford Parser.
Semantic Analysis
- amatch - Set of five distance types between strings (including Levenshtein, Sellers, Jaro-Winkler, 'pair distance').
- damerau-levenshtein - Calculates edit distance using the Damerau-Levenshtein algorithm.
- hotwater - Fast Ruby FFI string edit distance algorithms.
- levenshtein-ffi - Fast string edit distance computation, using the Damerau-Levenshtein algorithm.
- tf_idf - Term Frequency / Inverse Document Frequency in pure Ruby.
- tf-idf-similarity - Calculate the similarity between texts using TF/IDF.
Pragmatical Analysis
- SentimentLib - Simple extensible sentiment analysis gem.
High Level Tasks
Spelling and Error Correction
- gingerice - Spelling and Grammar corrections via the Ginger API.
- hunspell-i18n - Ruby bindings to the standard Hunspell Spell Checker.
- ffi-hunspell - FFI based Ruby bindings for Hunspell.
- hunspell - Ruby bindings to Hunspell via Ruby C API.
Text Alignment
- alignment - Alignment routines for bilingual texts (Gale-Church implementation).
Machine Translation
- google-api-client - Google API Ruby Client.
- microsoft_translator - Ruby client for the microsoft translator API.
- termit - Google Translate with speech synthesis in your terminal.
- zipf - implementation of BLEU and other base algorithms.
Dialog Systems
- chatterbot - Straightforward ruby-based Twitter Bot Framework, using OAuth to authenticate.
- lita - chat operation bot framework written with persistent storage provided by Redis.
Sentiment Analysis
Numbers, Dates, and Time Parsing
- chronic - Pure Ruby natural language date parser.
- chronic_between - Simple Ruby natural language parser for date and time ranges.
- chronic_duration - Pure Ruby parser for elapsed time.
- kronic - Methods for parsing and formatting human readable dates.
- nickel - Extracts date, time, and message information from naturally worded text.
- tickle - Parser for recurring and repeating events.
- numerizer - Ruby parser for English number expressions.
Named Entity Recognition
- ruby-ner - Named Entity Recognition with Stanford NER and Ruby.
- ruby-nlp - Ruby Binding for Stanford Pos-Tagger and Name Entity Recognizer.
Text-to-Speech-to-Text
- espeak-ruby - Small Ruby API for utilizing 'espeak' and 'lame' to create text-to-speech mp3 files.
- tts - Text-to-Speech conversion using the Google translate service.
- att_speech - Ruby wrapper over the AT&T Speech API for speech to text.
- pocketsphinx-ruby - Pocketsphinx bindings.
Linguistic Resources
- rwordnet - Pure Ruby self contained API library for the Princeton WordNet®.
- wordnet - Performance tuned bindings for the Princeton WordNet®.
Machine Learning Libraries
Machine Learning Algorithms in pure Ruby or written in other programming languages with appropriate bindings for Ruby.
For more up-to-date list please look at the Awesome ML with Ruby list.
- rb-libsvm - Support Vector Machines with Ruby.
- weka-jruby - JRuby bindings for Weka, different ML algorithms implemented through Weka.
- decisiontree - Decision Tree ID3 Algorithm in pure Ruby [post].
- rtimbl - Memory based learners from the Timbl framework.
- classifier-reborn - General classifier module to allow Bayesian and other types of classifications.
- lda-ruby - Ruby implementation of the LDA (Latent Dirichlet Allocation) for automatic Topic Modelling and Document Clustering.
- liblinear-ruby-swig - Ruby interface to LIBLINEAR (much more efficient than LIBSVM for text classification).
- linnaeus - Redis-backed Bayesian classifier.
- maxent_string_classifier - JRuby maximum entropy classifier for string data, based on the OpenNLP Maxent framework.
- naive_bayes - Simple Naive Bayes classifier.
- nbayes - Full-featured, Ruby implementation of Naive Bayes.
- omnicat - Generalized rack framework for text classifications.
- omnicat-bayes - Naive Bayes text classification implementation as an OmniCat classifier strategy.
- ruby-fann - Ruby bindings to the Fast Artificial Neural Network Library (FANN).
Data Visualization
Please refer to the Data Visualization section on the Data Science with Ruby list.
Optical Character Recognition
- tesseract-ocr - FFI based wrapper over the Tesseract OCR Engine.
Text Extraction
- yomu - library for extracting text and metadata from files and documents using the Apache Tika content analysis toolkit.
Full Text Search, Information Retrieval, Indexing
- rsolr - Ruby and Rails client library for Apache Solr.
- sunspot - Rails centric client for Apache Solr.
- thinking-sphinx - Active Record plugin for using Sphinx in (not only) Rails based projects.
- elasticsearch - Ruby client and API for Elasticsearch.
- elasticsearch-rails - Ruby and Rails integrations for Elasticsearch.
- google-api-client - Ruby API library for Google services.
Language Aware String Manipulation
Libraries for language aware string manipulation, i.e. search, pattern matching, case conversion, transcoding, regular expressions which need information about the underlying language.
- fuzzy_match - Fuzzy string comparison with Distance measures and Regular Expression.
- fuzzy-string-match - Fuzzy string matching library for Ruby.
- active_support -
RoR
ActiveSupport
gem has various string extensions that can handle case. - fuzzy_tools - Toolset for fuzzy searches in Ruby tuned for accuracy.
- u - U extends Ruby’s Unicode support.
- unicode - Unicode normalization library.
- CommonRegexRuby - Find a lot of kinds of common information in a string.
- regexp-examples - Generate strings that match a given regular expression.
- verbal_expressions - Make difficult regular expressions easy.
Articles, Posts, Talks, and Presentations
- 2017
- Scientific Computing on JRuby by Prasun Anand [slides | video | slides | slides]
- Unicode Normalization in Ruby by Starr Horne [post]
- 2016
- Quickly Create a Telegram Bot in Ruby by Ardian Haxha [tutorial]
- Deep Learning: An Introduction for Ruby Developers by Geoffrey Litt [slides]
- How I made a pure-Ruby word2vec program more than 3x faster by Kei Sawada [slides]
- Dōmo arigatō, Mr. Roboto: Machine Learning with Ruby by Eric Weinstein [slides | video]
- 2015
- N-gram Analysis for Fun and Profit by Jesus Castello [tutorial]
- Machine Learning made simple with Ruby by Lorenzo Masini [tutorial]
- Using Ruby Machine Learning to Find Paris Hilton Quotes by Rick Carlino [tutorial]
- Exploring Natural Language Processing in Ruby by Kevin Dias [slides]
- Machine Learning made simple with Ruby by Lorenzo Masini [post]
- Practical Data Science in Ruby by Bobby Grayson [slides]
- 2014
- Natural Language Parsing with Ruby by Glauco Custódio [tutorial]
- Demystifying Data Science: Analyzing Conference Talks with Rails and Ngrams by Todd Schneider [video | code]
- Natural Language Processing with Ruby by Konstantin Tennhard [video | video | video | slides]
- 2013
- How to parse 'go' - Natural Language Processing in Ruby by Tom Cartwright [slides | video]
- Natural Language Processing in Ruby by Brandon Black [slides | video]
- Natural Language Processing with Ruby: n-grams by Nathan Kleyn [tutorial | code]
- Seeking Lovecraft, Part 1: An introduction to NLP and the Treat Gem by Robert Qualls [tutorial]
- 2012
- Machine Learning with Ruby, Part One by Vasily Vasinov [tutorial]
- 2011
- Ruby one-liners by Benoit Hamelin [post]
- Clustering in Ruby by Colin Drake [post]
- 2010
- bayes_motel – Bayesian classification for Ruby by Mike Perham [post]
- 2009
- Porting the UEA-Lite Stemmer to Ruby by Jason Adams [post]
- NLP Resources for Ruby by Jason Adams [post]
- 2008
- Support Vector Machines (SVM) in Ruby by Ilya Grigorik [post]
- Practical text classification with Ruby by Gleicon Moraes [post | code]
- 2007
- Decision Tree Learning in Ruby by Ilya Grigorik [post]
- 2006
- Speak My Language: Natural Language Processing With Ruby by Michael Granger [slides | write-up | write-up]
Projects and Code Examples
- Going the Distance - Implementations of various distance algorithms with example calculations.
- Named entity recognition with Stanford NER and Ruby - NER Examples in Ruby and Java with some explanations.
- Words Counted - examples of customizable word statistics powered by words_counted.
Books
- Miller, Rob. Text Processing with Ruby: Extract Value from the Data That Surrounds You. Pragmatic Programmers, 2015. [link]
- Watson, Mark. Scripting Intelligence: Web 3.0 Information Gathering and Processing. APRESS, 2010. [link]
- Watson, Mark. Practical Semantic Web and Linked Data Applications. Lulu, 2010. [link]
Community
Needs your Help!
All projects in this section are really important for the community but need more attention. Please if you have spare time and dedication spend some hours on the code here.
- ferret - Information Retrieval in C and Ruby.
- summarize - Ruby native wrapper for Open Text Summarizer.
Related Resources
- Awesome Ruby - Among other awesome items a short list of NLP related projects.
- Ruby NLP - State-of-Art collection of Ruby libraries for NLP.
- Speech and Natural Language Processing - General List of NLP related resources (mostly not for Ruby programmers).
- Scientific Ruby - Linear Algebra, Visualization and Scientific Computing for Ruby.
- iRuby - IRuby kernel for Jupyter (formelly IPython).
- Kiba - Lightweight ETL (Extract, Transform, Load) pipeline.
- Awesome OCR - Multitude of OCR (Optical Character Recognition) resources.
- Awesome TensorFlow - Machine Learning with TensorFlow libraries.
- rb-gsl - Ruby interface to the GNU Scientific Library.
- The Definitive Guide to Ruby's C API - Modern Reference and Tutorial on Embedding and Extending Ruby using C programming language.
Contributing
We are very glad to see you in this section and highly appreciate any help!
But we also take care about the quality of this list. If you want to contribute please:
- agree that your work will be published under the terms of the
CC0
license; - carefully read the Contribution Guidelines.
Some of the open tasks for contributors are listed in the todo file. You may want to start there.
License
Awesome NLP with Ruby
by Andrei Beliankou and
Contributors.
To the extent possible under law, the person who associated CC0 with
Awesome NLP with Ruby
has waived all copyright and related or neighboring rights
to Awesome NLP with Ruby
.
You should have received a copy of the CC0 legalcode along with this work. If not, see https://creativecommons.org/publicdomain/zero/1.0/.