Creating a word embedding in Python (using gensim) from the U.S. Tax Code
Introduction I’ve run across a need to create a word embedding for legal tax jargon, specifically centered in the United States Internal Revenue code (aka, the tax code). The tax code is available in an XML format. The XML format is convenient because it allows us to extract the different sections of the tax code. A section is the basic “level” of the tax code document hierarchy (see section 7.
Continue Reading