From the Georgetown Law Advanced E-Discovery Institute: Advanced Search and Retrieval Technology

Georgetown Law CLE 2

15 November 2009

The presentation on Advanced Search and Retrieval Technology was made by Jason R. Baron, Maura Grossman and Ralph Losey, all powerhouses in the e-discovery world.

Baron and Losey started off with their multimedia PowerPoint presentation (to the tune of Darude’s Sandstorm which we had just seen at the Capital One Future of Search conference  and it blew away the crowd – and us, too, again.  In a nutshell, ediscovery is expanding exponentially and Ralph Losey talked petabytes, and exabytes — not terabytes. This was the “beta version” of a presentation that Losey and Baron will give at LegalTech in New York City this coming February.  

As an introduction (not necessary for this audience but a great set-up nonetheless) Jason said there are technologies available to help the litigator reduce the costs of reviewing and producing ESI while at the same time accomplish the objective of responding to a request for production.  Most commonly used by litigators today are review tools that enable reviewers to review the ESI in an online repository.  Vendors that provide these review tools also typically offer filtering and processing services, where they take ESI that has been collected, and, behind the scenes, apply filters to the ESI to narrow the volume to the ESI that is likely to be relevant to the request for production.

A popular filter is the application of keywords, developed by the litigator, to the collected ESI. After applying the keywords, the vendor provides a “frequency report” or “hit list” of the number or percentage of documents that hit on a particular keyword so that the litigator can evaluate the efficacy of the selected keywords.  

There may be various iterations of this process until the litigator approves the results in the frequency report.  The vendor then processes the filtered ESI and uploads it to a web-based review tool for the review to begin.

There is also new automated technology called “early case assessment” technology that has entered the marketplace, and which review tool vendors are rushing to add to their current products. This technology allows for a thorough front-end look at the volume of ESI collected in response to the request for production, instead of just the ESI that is filtered, processed and uploaded to the review tool. Thus, by using this new technology, the litigator can find the “significant documents” very early on in the case instead of waiting until the end of the review process after the reviewers have reviewed and “tagged” the significant documents.

Moreover, this technology enables the administrator and/or the litigator to perform keyword searching and other filtering on their own without incurring any additional charges and without having to rely on the vendor for these services. This technology also provides automated analytics so that the litigator can obtain a high level understanding of the ESI, which can identify key players, lines of communications between custodians and types of significant documents. This knowledge will help shape the review and the litigator’s investigation of the facts of the case.

Maura Grossman then followed with what we thought was a brilliant presentation on the challenges of search.  Our review cannot do it justice (we have links below to background material provided by Maura and Jason) so just some high points from her presentation:

1.  There is no way to review everything manually, in large matters, in the time frames dictated by the typical litigation or investigation.

2.  Manual review does not scale well, and how the cost of responsiveness and privilege review can quickly dwarf the costs of all of the other stages of the e-discovery process.

3.  Lawyers are not nearly as talented at search as they think they are.  The Blair and Maron study (in 1985) was the first study to demonstrate the significant gap or disconnect between lawyers’ perceptions of their ability to ferret out relevant documents, and their actual ability to do so.   In a 40,000 document case — consisting of 350,000 pages — the lawyers estimated that their searches had identified 75% of the relevant documents,  when, in fact, they had only identified about 20% of them.

4.  The use of keywords, alone, is unlikely to reliably produce all relevant documents from a large, heterogeneous document collection, for a whole host of reasons, including:

     a.  That information retrieval is already a very difficult problem when it involves plain vanilla, English-language, text documents. That problem is magnified when you address a multi-lingual set of documents, with nontextual forms of ESI, such as photographs or audio and video files, which are typically not searchable.

      b.  The inherent ambiguity of language, in particular:

            Synonymy = there can be considerable variation in describing the same person or thing, i.e., diplomat, ambassador, consul, official, etc.

           Polysemy = the same term can have multiple meanings, i.e., Bush (referring to two presidents; a shrub; a place in Africa; a thick furry tail; “bush league,” among other slang usages). Strike (referring to a labor activity; the act of hitting; the baseball kind; finding oil or gold and “striking it rich;” and so on).

       c.  The ubiquity of human error, i.e., misspellings and typos (there were 250 different spellings for the word “tobacco” in the MSA database; “management” will miss managment” and “mangement”).

       d.  Abbreviations (i.e., “P&C/ACC”); colloquialisms (i.e., Haynes & Boone / H&B / HayBoo); slang; code words; and new short-forms used in text messaging and IM (i.e., “FWIW”, “LMAO”).

      e.   The problem is compounded by optical character recognition (“OCR”).

      f.  Poor records management, including lack of organization and/or proper labeling, the reflexive use of “Reply” even when the subject matter of an email has changed, and so on.

      g.  Deadlines and resource constraints that place practical limits on what can be achieved.

       h.  And finally, there is a widespread failure to employ “best practices” in the area of search and retrieval. Lawyers believe that because they know how to use Westlaw, Lexis, and Google, they know how to do search, but finding a few good examples of something is a very different task than finding as close to all of that thing as possible, without also including a lot of junk.

So, what are the “best practices” for keyword searching?

1.  You start with the complaint, the subpoena, or the request for production. First  you determine: who are the relevant custodians?  what is the applicable time frame?  what terms-of-art are employed?  

2.  Then, you translate what the request is seeking into plain, everyday English to get as close as possible to the terms that people are most likely to use in their daily communications.

3. Try to have a couple of different people do this to ensure that you are getting the benefit of multiple interpretations of the requests and potential keywords from different vantage points.

4.  This is the basic starting point for your search-term list.

5.  Next—and this is the step that is most often overlooked by lawyers—you must seek input from the people who actually created, sent, or received the documents.  These are your best subject-matter experts.

6.  Ask them questions like:  “Who would be most likely to have created, sent, or received emails or documents on these subjects?”  “What distribution lists would have been used?”  “What time frame would these emails or documents cover?”   “What events would these emails or documents discuss?”   “What names, words, or terms would be likely to appear in these emails or documents?”  “What abbreviations, acronyms, slang, or code words might have been used?”   “If you were looking for emails or documents responsive to these requests,  how would you go about finding them?”  “What kinds of attachments would these emails have?”

7. If warranted by the stakes of your matter, consider whether an hour or two of a linguist’s or substantive expert’s time would help you to significantly improve the quality of your search term list.

8. Next, look at a bunch of documents that you already know to be responsive (for example, some that you obtain from a key custodian).  Ask yourself, what unique words or phrases distinguish these documents? In what context do the documents appear? (If you are using a search tool that employs machine learning, these documents can be the start of your “seed” or training set.)

9. If possible, have your vendor index the documents in the set and provide you with a list of the words that appear in the documents, ranked from most to least frequently appearing. Use that list to identify documents that are likely to be unresponsive (“birthday,” “baby shower”) or privileged, and to identify search terms you may have missed.

 

Ok, there was a lot more.  To help, here is a link to Jason and Maura’s slides (click here).

Some  suggested references:

* Craig Ball has a paper on his website summarizing search steps.   It is entitled “Surefire Steps to Splendid Search” (June/July 2009) (Click here).

* The Sedona Conference® Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery (Aug. 2007 Public Comment Version) (click here)

*  The Sedona Conference® Commentary on Achieving Quality in the E-Discovery Process (May 2009 Public Comment Version) (click here)

* The National Institute of Standards and Technology (NIST) Text REtrieval Conference (TREC) 2009 Legal Track (click here)  

 

Take-Away Messages from the panel

1.  Success in search requires a well thought-out process with substantial input at the front-end and some degree of testing, sampling, feedback and/or iteration.

2.  The amount of testing, sampling, feedback and/or iteration should reflect the same proportionality considerations inherent in all discovery, i.e., the amount in controversy, the time and resources available, the importance of the evidence to the determination of the dispute, etc.

3.  Different search approaches are best for different tasks. For example, some things are simply easier to search for than others, i.e., patent or pharmaceutical litigation versus evidence regarding off-shore accounts or document destruction/shredding.  Do you need a few good examples, or are you trying to find “all”?

4.  There is no guarantee that any search method will identify all responsive documents in a large, homogeneous data set, and different search methods can produce different result sets. Hybrid or fusion approaches tend to be more successful, but are also more costly and time-consuming.

5.  Automated technology can help, but its not the “end-all-be-all.” Due diligence is absolutely necessary in this current “Wild West” marketplace.

6.  At least some degree of transparency and collaboration is necessary. Obviously, an agreed-upon search methodology (or search-term list) is preferable to a unilateral approach that is subject to second-guessing and “do-overs.”  Parties must be able to explain what they have done and why it is reasonable under the circumstances. 

7.  It is important for practitioners to keep up with the case law, research, and literature in this area because it is quickly evolving. There are consultants (including linguists and statisticians) who have expertise in this area and can help devise or mediate a reasonable search protocol if the parties cannot agree on one.

A  (very) brief note on Text REtrieval Conference (TREC)

TREC was mentioned several times at the panel (and all during the conference) especially the opportunity of  participating in the 2010 TREC Legal Track.  We will have a detailed post on TREC before the year out but just a short “bio” on TREC from Ellen M. Voorhees of the National Institute of Standards and Technology (NIST) who was scheduled to appear but could not:

Evaluation is a fundamental component of the scientific method: researchers form a hypothesis, construct an experiment that tests the hypothesis, and then assess the extent to which the experimental results support the hypothesis.  A very common type of experiment is a comparative experiment in which the hypothesis asserts that Method 1 is a more effective solution than Method 2, and the experiment compares the performance of the two methods on a common set of problems.

The set of sample problems together with the evaluation measures used to assess the quality of the methods’ output form a benchmark task.  Information retrieval researchers have used test collections, a form of benchmark task, ever since Cyril Cleverdon and his colleagues created the first test collection for the Cranfield tests in the 1960’s. Many experiments followed in the subsequent two decades and several other test collections were built.

Yet by 1990 there was growing dissatisfaction with the methodology. While some research groups did use the same test collections, there was no concerted effort to work with the same data, to use the same evaluation measures, or to compare results across systems to consolidate findings. The available test collections were so small—the largest of the generally available collections contained about 12,000 documents and fewer than 100 queries—that operators of commercial retrieval systems were unconvinced that the techniques developed using test collections would scale to their much larger document sets. Even some experimenters were questioning whether test collections had out-lived their usefulness.

At this time, NIST was asked to build a large test collection for use in evaluating test retrieval technology developed as part of the Defense Advanced Research Projects Agency’s TIPSTER project. NIST proposed that instead of simply building a single large test collection, it organize a workshop that would both build a collection and investigate the larger issues surrounding test collection use. This was the genesis of the Text REtrieval Conference (TREC). The first TREC workshop was held in November 1992, and there has been a workshop held annually since then.

We will have a detailed post on TREC before the year out.

4 comments

Comments are closed