Bloomberg researchers publish kōan, making CBOW implementations faster and more accurate

Early in 2020, two of Bloomberg’s AI researchers, Adrian Benton and Ozan İrsoy, were conducting research into more efficient contextual word embeddings with Karl Stratos, an assistant professor in the Computer Science Department at Rutgers University (and former member of the Bloomberg AI Engineering group). During their work, the trio discovered a simple, but impactful, error in the gradient computations performed by Word2vec, a commonly-used natural language processing (NLP) application for learning word embeddings.

They recently released open source code detailing an alternative implementation to Gensim and word2vec.c when training word embeddings. With this implementation, researchers may find that continuous bag-of-words (CBOW) becomes as performant as skip-gram when used in downstream models. This implementation, called kōan, is also faster than both in many settings (see the detailed benchmarks in their paper).

Their technical report, “kōan: A Corrected CBOW Implementation,” which details their experiments, has been published on arXiv.

The code for their implementation is now available on GitHub:

Click on the image above to read and download the full paper