Understand the user's intent and produce answers, like Apple's Siri. Except in more languages, more flexibility, and more challenging scenarios, with lightning-fast development.
Extract named entities, like ThingFinder or OpenCalais. Except in more languages and with more options.
Translate foreign content, like Google Translate. Except more customizable, and without using English as a pivot.
Find out a sentiment from a social media post, with a thorough breakdown what contributed to the sentiment.
Extract actionable intelligence from customers' feedback, like Clarabridge. Except in more languages and more customizable.
Let the user search content in foreign languages, like... OK, no one else does it yet.
Reasoning machine. Linguistic filter. Virtual machine for natural language processing. A backbone for Web 3.0 applications.
One incredible software, that does it all, called Carabao Linguistic Virtual Machine, encapsulating much of the human linguistic ability.
Reusable components are a cornerstone of engineering (not only software). Every system having the component inside is a chance to polish and test the component. The more the component is tested, the more reliable it is, and the possibility of a glitch is lower. A new system embedding the same component is getting a sum of all its experiences with all the other systems.
This is hardly a secret. DRY (don't repeat yourself) is an almost universally accepted principle of software engineering. That's right, almost. Sadly, it is not at all universally accepted in the world of the linguistic software. While it can be argued that different linguistic tasks involve different logic, much of the core logic is the same. And, most importantly, having completed the gargantuan task of understanding the utterance reduces all these tasks to nearly trivial. After all, language is a symbolic representation of the real world, and the bulk of the difficulty is making sense of the infinite set of words.
There are several reasons why these components are so rare. It is difficult to justify such a massive undertaking with no clear product in sight (let alone the practical difficulty of explaining the purpose to the non-technical decision makers). Additionally, the natural language processing software industry is driven by academic mindset, putting more emphasis on a task at hand rather than the general architecture. Finally, building a universal multilingual engine requires a multilingual perspective and an enormous effort.
Did the effort of building Carabao, spanning several years, pay? Sure it did. See the list of the applications above, which is not at all complete and keeps growing. More applications are being created, with exotic ones, like conversion of text to pictures, or attaching the engine to brain-computer interfaces, in sights as well. Not only this, but the intermediate products, like the morphological models, can be utilized in applications like search engines.
The principle of Carabao is fairly straightforward. The caller application, whether it is a natural language user interfaces, sentiment analysis, machine translation application, feeds text content into the LVM, which based on the source language model from an underlying linguistic database, converts it into language neutral codes (more specifically, semantic references, grammatical information, and style information). In case of a transformatory application, like MT or paraphrasing, these language neutral codes are rearranged according to the target language model. The caller application makes use of either the language neutral codes or the transformed content (depending on what it is supposed to accomplish).
For instance, in case of a natural language user interfaces application like NLUI Server, the codes are examined by the script-running engine and matched against conditions associated with the script's application. Since we are getting codes for the actual entities behind the words, the same script works for all languages in the database. Moreover, we may take advantage of the semantic network, and capture whole categories. For example, with a condition like hypernym=8981 we can capture all kinds of "Asian", e.g. "Chinese", "Indian", "Japanese", "Korean", and finer divisions like "Cantonese", "Sichuan", etc.
Much of today's linguistics is dominated by Eurocentric concepts, which are not well adapted to languages built with a completely different mindset. For instance, the same word in Chinese may function as a noun or a verb or even a conjunction (translated to Indo-European concepts, that is); semantics play a far bigger role there. This makes the concept of "part of speech" somewhat artificial in this scenario. (Of course, it is possible to build language models using any theoretical construct, but this is like trying to put on ill-fitting clothes.) This is just one out of numerous examples.
When adding new languages en masse, these differences become obvious very soon. Therefore, when equipped with rigid one-size-fits-all linguistic concepts, developers either have to "live with it" or build special modules to bypass the mainstream logic (which, obviously, does not make development cheaper and less error-prone).
Therefore, there are no linguistic concepts defined in Carabao executables. All of them sit in the linguistic database, bundled together with the rest of the linguistic logic. The concept of metagrammar-based architecture has been described by Dr. Emily M. Bender from University of Washington. Strictly speaking, Carabao can be used to model any language, be it English, Chinese, German, or dolphines' clicks and whistles.
In order to make the processing uniform, the beginning of the processing (shallow preprocessing and morphological analysis) is essentially converting the text string into a list of internal structures, based on the internal dictionary. Different modules are taking care of various segmentation modes: tokenization for languages not using white spaces (like Chinese, Japanese, Thai), decompounding for languages like German, Dutch, Finnish using compounds, desegmentation for languages like Vietnamese and Malay which insert spaces between parts of some words, and clitic extraction for many European and Semitic languages. As these operations are not trivial, these components are exposed via Carabao API having yet another use for the linguistic virtual machine.
But the most important aspect of the linguistic abstraction is the actual "linguistic virtual machine" architecture. In order to run all the available languages on a new platform, the language models with all their linguistic instructions and concepts don't have to be ported; only the runtime, Carabao LVM itself, must be ported. Just like with any other virtual machine architecture-based systems, e.g. Java or .NET.
Semantic Network and Interlingua Plus
The principle of interlingua, assigning same codes to coordinate terms in different languages, historically had implementation issues mosty due to the database building difficulties. However, the recent advances made it possible to build adequate multilingual databases. Carabao employs structure which can be termed as "interlingua plus": the terms are grouped by concept into so-called families. A family is a set of records, where each record has a lemma, a stem, and some metadata, such as grammar (called "rule units" in Carabao as it is not always strictly classic grammar) and stylistic information (e.g. regional use, professional use, medium of use, and more). Hence, in addition to the semantic dimension (family ID number), we have the grammatic dimension, and the stylistic dimension. The family structure allows for easy way to store exceptions (for instance, bought is simply a record with a more complete set of rule units in the same family as buy), and an easier way to build working interlingua (if there is an obscure term which is "not exactly the same", simply store it in the same family giving it an obscure style tag).
As the family IDs are the same across languages, the multilingual lexicons can be used for crosslingual retrieval.
Perhaps uniquely to Carabao LVM, the patterns capturing the syntactic structure (so-called "sequences") are also interlingual, and can be even stored in the same families as words, allowing capturing idiomatic expressions with gaps.
This is just a short introduction into Carabao Linguistic Virtual Machine. As a novel concept, it is often misunderstood. Carabao is not a natural language user interface, machine translation, or sentiment analysis framework; it powers these solutions.
Today, after years in development, Carabao is no longer a proof of concept level framework, but a mature product, used by thousands of people worldwide.