Machine vs Human translation – an update

Machine-vs-human-translation (1)

 

 

Gregory P. Bufithis, Esq.

Founder/Chairman     The Posse List

 

30 May 2014– Ah, technology. So hard to keep up. Just on the medical front alone the cost of manipulating life’s building blocks is falling at such a rapid clip … faster than Moore’s Law predictions for silicon … that we can now tailor drugs to match a person’s genome, and fuse technology with living matter.

And so it goes in the legal field.

Last year we ran a post about Google’s quest to end the language barrier and the possible effect on non-English e-discovery document reviews. We wrote about the long way companies had come toward making automated language translation easier, faster and more reliable, a world of seamless and immediate translation still out of our grasp.

But … things are getting better and better. This past week Microsoft announced that nearly instantaneous speech translation will be available on Skype by the end of the year under plans by Microsoft to “cross the language boundary” on its popular messaging service. Satya Nadella, Microsoft’s new chief executive, unveiled a test version of the software that can translate voice calls between people at its inaugural Code Conference in Rancho Palos Verdes, California. A conversation was held on stage to show a service that was already good enough to translate from English to German with only a short pause, and 99% accurate.

It is early days for this technology, and rival services are being developed by Google and NTT DoCoMo but Skype would offer considerably larger scale. Microsoft said that Skype has more than 300m connected users each month, and more than 2bn minutes of conversation a day. Microsoft has been experimenting with machine translation for more than a decade and has combined it with separate work on speech recognition accuracy. Microsoft’s Cortana voice assistant for Windows Phone already understands voice patterns. Microsoft has been developing a “deep neural net” that can recognize speech and learn from languages. More to come from my CTO later.

Easier, faster and more reliable translations, a world of seamless and immediate translation within our grasp. Yes, still the usual issues: context, syntax, intonation and ambiguity. Because a computer system is not context aware, it could grab the wrong word. Additionally, it doesn’t understand the language at all. It just tries to decode words, instead of decoding the meaning. Many languages are not similar at all, and do not have corresponding common words and/or their usage is not the same at all.

But the technology continues to improve. Last month in Paris I attended a Microsoft Research event for its Natural Language Processing group and saw the further developments (over last year) on their Machine Translation (MT) project which is focused on creating MT systems and technologies that cater to the multitude of translation scenarios today, including legal. The key is Statistical Machine Translation (SMT) and that breaks down into areas such as syntax-based SMT and phrase-based SMT. Plus there is Word Alignment and Language Modeling technologies. This week I am in Israel at a Google workshop on advanced language modeling.

These toolkits mean that problems with morphology, syntax, semantics and word sense disambiguation are being solved. Not solved yet, but coming. For the vendors and the multinational companies who need it, the business model is a no brainer. The value of an automated, instant, seamless translation platform to a corporation means the vendor that solves it could charge a substantial amount of money for such a tool.

And credit where credit is due. IBM really started it all decades ago when it set up the first machine translation architectures based on mathematical models called translation models. Generally speaking, a translation model accounts for all of the elementary operations that rule the process of translation between the different word orderings of the source and target languages. Translation models are usually enriched with statistical parameters, to help drive the search in the space of all valid transformations of the source sentence into the target sentence. IBM developed specialized algorithms to provide for the automatic estimation of these parameters.

The Google technology is, at its simplest, a technique that automatically generates dictionaries and phrase tables that convert one language into another. The new technique does not rely on versions of the same document in different languages. Instead, it uses data mining techniques to model the structure of a single language and then compares this to the structure of another language. This method makes little assumption about the languages, so it can be used to extend and refine dictionaries and translation tables for any language pairs.

Ah, the sheer beauty of algorithms. We have written about this numerous times before: quantitative prediction and how it continues to shake up numerous professional services industries by automating or semi-automating tasks previously performed by experts. How it has already changed the legal services industry is now well trodden ground.

Humans are not exactly known for their predictive skills if one believes Daniel Kahneman’s argument in Thinking, Fast and Slow or Nicholas Taleb’s assertions in his book Antifragile: Things That Gain from Disorder. The advent of computer assisted review/ technology assisted review/predictive coding to document review processes has shown that advanced computer analytics can produce more accurate results than reviews using only keyword search and human review. Oh, some debate still lingers on when-to-use-and-when-not-to-use it and a few quibbles on the numbers but it is slowly being embraced for several uses. And the economic downturn in the legal industry, and the associated cost control pressures from corporate clients have further increased the speed at which quantitative prediction solutions have been adopted.

It has not yet decimated the contract attorney industry. Brute force document still rules the day. Our Posse List contract attorney job posting service does not post every document review job, but a Posse List member is invariably on just about very document review job, and our ever loyal cadre reports in on The Good, The Bad, and The Ugly. Right now in D.C., for example, there is a document review project involving a Fortune 100 company that, 2 years ago, would have required 100+ reviewers. This time around it is being handled with 35. Pretty much the same size data universe as 2 years ago but this time … a predictive coding platform is being used. The technology is faster, better, cheaper. That is what will become the great disruptor in the e-discovery market when it comes to the contract attorney sector. Assuming the power (read: financial vested interests) struggle among/between law firm-vendor-corporate client ever resolves itself. But the technology is driving the change. Staffing agencies and e-discovery vendors and even corporate in-house legal departments are utilizing “data swat-teams” comprised of contract attorneys who possess the tech skills + the analysis ability with a greater emphasis on data search specialists who have the ability to conduct complex searches, analyze information and generate reports. But fewer bodies are required.

The only contract attorneys deemed “safe” are those who have fluency in one or more non-English languages. Non-English document reviews were up 38% last year according to our sister company, The Posse List, and comprised 72% of all document review jobs they posted. If you follow their job postings you know the hourly pay rates for non-English reviews are significantly higher, downright astronomical for the CJKs … Chinese, Japanese, Korean.

Ah, but times might just be a-changin’. Just as predictive coding has the potential to rend the English language document review market (and is being used in more and more in non-English language document reviews) those nasty algorithms are making their way across all languages. As Marc Andreessen said several years ago in his prescient essay Software is eating the world “all of the technology required to transform industries through software is finally working and can be widely delivered at global scale … don’t be on the wrong side of software-based disruption”.