Why NLP in Data Analytics Shouldn’t be a Black Box

Kaycee Lai
Feb 24, 2020
3 min read

The chief goal of Augmented Data Management (ADM) is to greatly simplify and speed the processes surrounding data analytics. Normally we think of ‘abstracting the complexity’ out of a process as a good thing, and rightly so. But is there a point where ‘oversimplification’ becomes detrimental? In virtually all aspects of life, things can be oversimplified, and applications of NLP in commercial AI are no exception. How much ‘abstraction’ is too much? If you’re not able to arrive at the right information, it’s too simple.

What is a ‘black box’?

Wikipedia defines it as “...a device, system or object which can be viewed in terms of its inputs and outputs...without any knowledge of its internal workings.” NLP (and many other forms of AI and ML) are widely cited as being ‘black box’ approaches, as the relationships that they model between words are often so multi-dimensional as to be practically impossible to visually represent or for humans to wrap their brains around. In other words, NLP computer models can tell you with great accuracy the results of their calculations, but they aren’t so great at telling you how they arrived there. Often they require a second model simply to help interpret the results of the first, which can have very serious implications for putting trust in their results.

What does this mean for business?

In the case of more commonly known use cases, like language translation or voice transcription, it’s pretty easy to spot poorly performing NLP models. It might not be easy to fix them, mind you, but at least you know you’re not running blindly in the wrong direction.

But in other use cases poor NLP model performance might not be so easy to spot. Take, for instance, using NLP to generate SQL queries that join tables from multiple databases. Since most companies have data infrastructures that are a bit of a mess, chances are they have multiple tables with overlapping data that could possibly be joined and analyzed. But, inevitably, some tables are better than others (more complete, accurate data, etc.).

In this case, the ‘black box’ could be described as the lack of ability to see how the final SQL query was arrived at, in order to adjust it and ensure that the best tables are joined for the most accurate analysis. Furthermore, the black box can derail compliance efforts. As Daniel Fabbri, a professor of computer science at Vanderbilt put it, “If you cannot state what the machine learning algorithm is doing, how can you define what your policy is or even defend it to regulators?”

If you Google directions, there’s a reason why you see step-by-step instructions with alternative routes. People need to be able to customize results based on what they know and trust. This means that you can’t give them a pure black box for anything that’s NLP, ML or AI related--you need to retain an appropriate level of human interaction.

Turning the black box into a “clear box”

For Commercial AI/ML platforms to provide real value, they need to simplify processes without sacrificing the ability to customize to the needs of a particular business or department. This can be tricky. In the previous example we described, where NLP is used to generate SQL code, if it generates a statement that joins less than optimal tables, the SQL statement might as well be gibberish.

You can solve this by using NLP to generate the code, but also by giving the user the ability to quickly and easily 1) modify the statement based on additional knowledge about the data sources, 2) reassemble alternate tables, and 3) auto-generate a new SQL statement based on the amended table assembly. By supplementing artificial intelligence with human intelligence (or vice versa, if you prefer), you can get to the right answers in short order. This combination of NLP or AI-derived results combined with human oversight and intervention is a formula for success that can carry over into all kinds of applications.

When commercial AI operates as a ‘transparent and customizable box’ it allows a human’s domain expertise to be significantly augmented, eliminating time-consuming processes and greatly improving workflows.