Creating a word embedding in Python (using gensim) from the U.S. Tax Code

Tuesday, May 12, 2020| Tags: word embedding, natural language processing, tax

Introduction

I’ve run across a need to create a word embedding for legal tax jargon, specifically centered in the United States Internal Revenue code (aka, the tax code). The tax code is available in an XML format. The XML format is convenient because it allows us to extract the different sections of the tax code. A section is the basic “level” of the tax code document hierarchy (see section 7.5 of the U.S. Legislative Markup User’s Guide for more information). But more importantly, each section of the tax code is centered around a specific topic. So, at least for starters, it makes sense to train the word embedding by feeding it the sections of the tax code. The XML format makes this easy.

To create the word embedding, we’ll use Radim Rehurek’s gensim package. It has both the Word2Vec and FastTest algorithms built-in.

from gensim.models.word2vec import Word2Vec
from gensim.models.fasttext import FastText
from multiprocessing import cpu_count
import gensim.downloader as api
from bs4 import BeautifulSoup
from nltk.tokenize import word_tokenize

Read the XML document

It’s a large file, so it takes a while to load.

xmlFile = open("./data/usc26.xml", encoding="utf8")
xml = BeautifulSoup(xmlFile, features="xml")
xmlFile.close()

Extract the sections and prepare them

You can learn more about the structure of the XML on section 7.5 of the U.S. Legislative Markup User’s Guide.

Preparation of the sections entails the following elements:

  1. Extract all the <section> elements. There are some <section> elements that do not correspond to actual (topical) sections of the tax code. In the pure text, sections almost always start with the “§” symbol followed by a number. In XML, these look like this:

<section ...><num value="1">§ 1.</num><heading> Tax imposed</heading>...

<section> elements that do not correspond to topical sections are missing a number in the “value” attribute of the <num> tag, like this:

<section ...><num value=""/><chapeau>

We use BeautifulSoup’s findAll() function to extract all the <section> elements. Then we use a list comprehension to check if the <num>’s “value” attribute has is not empty. We end up with 2151 sections.

  1. Extract the pure text from the <section> elements. In other words, strip the xml tags out. We use BeautifulSoup’s getText() function for that. We end up with a list of 2151 strings.
# extract and filter <section> elements
sections = xml.findAll("section")
sections = [section for section in sections if section.num["value"]]
section_list = [section.getText(separator=" ") for section in sections]
len(section_list)
## 2151

Here’s a peek at the contents of one of those strings. You can compare it with a web-based version of the tax code here.

section_list[0][0:250]
## '§\u202f1.  Tax imposed (a)  Married individuals filing joint returns and surviving spouses There is hereby imposed on the taxable income of— (1)  every married individual (as defined in section 7703) who makes a single return jointly with his spouse under'
  1. Tokenize the section text. We use the word_tokenize() function from nltk.tokenize. But we have to preserve each section as an index in the list to train the word embedding. So we define a helper function, which also takes care of a few other details like forcing words to lowercase and removing non-alphanumeric words.
def tokenizeSections(section_list):
  words = []
  for section in section_list:
    words.append([word.lower() for word in word_tokenize(section) if word.isalnum()])
  return words
  
words = tokenizeSections(section_list)

Create the word embedding

gensim’s interface is pretty much the same for the Word2Vec and FastText algorithms. We’ll create both here. The FastText algorithm takes longer. gensim doesn’t currently support GPU, but it seems to train fast enough on CPUs.

Word2Vec

w2v_model = Word2Vec(words, min_count = 4, workers=cpu_count())

FastText

ft_model = FastText(words, min_count = 4, workers=cpu_count(), iter=30)

Test the word embeddings

Word embeddings allow computers to understand how similar two words are. In a tax context, we would expect a “deduction” (something that lowers your taxes) to be related to an allowance or credit or other things that reduce your tax liability.

w2v_model.wv.most_similar("deduction")
## [('deductions', 0.7352703809738159), ('allowance', 0.6905843019485474), ('allowable', 0.6510364413261414), ('credits', 0.6339372992515564), ('credit', 0.6309539079666138), ('depreciation', 0.6180007457733154), ('allowed', 0.6096256971359253), ('carryover', 0.5693126320838928), ('depletion', 0.5520280003547668), ('151', 0.5477799773216248)]
ft_model.wv.most_similar("deduction")
## [('deductions', 0.8794976472854614), ('deducts', 0.7809832692146301), ('reduction', 0.7809756994247437), ('deductibility', 0.77668696641922), ('deduct', 0.7533663511276245), ('deducting', 0.7462096810340881), ('deductibles', 0.7056588530540466), ('allowance', 0.7005971670150757), ('induction', 0.6958378553390503), ('auction', 0.6953754425048828)]

You can clearly see that the FastText algorithm incorportates stemming, making words that derive from the same root more similar.

Conclusion

This is a very basic word embedding. There are a number of things that can be done to improve its performance. For example, extract and preserve ngrams. There are a number of multi-word phrases in tax that have significant meaning (e.g., “adjusted gross income” and “child tax credit”). Removing stop words (e.g., “a”, “the”, and “an”) might also help.

More conceptually, it seems reasonable that section references inside the code should be semantically important. For example, we all know what a 401K is, but few know that 401(k) is a section of the tax code that describes the retirement fund details.

QUESTIONS?

I'd be happy to sit down with you for a free consultation.

Contact me