Here at RAVN Systems we’re always looking for new ways to deal with unstructured data. Approaches we’re using on a day-to-day basis include:
- Indexing/searching: RAVN Pipeline connects to multiple repositories that can contain unstructured data (such as filesystems, CMS systems and so on). The documents it crawls are typically indexed into RAVN Core which can be seen as a centralised repository of documents optimised for retrieval, search and linking documents together.
- Clustering: typically exposed in a search interface, this approach allows us to reveal to end users what topics are discussed in a set of documents, typically the set of documents returned in a search result.
- Classifying: which documents belong to which of a set of predefined labels. An example of usage might be automatically placing previously unseen documents into a folder structure.
- Semantic analysis: rather than focussing on a set of documents as a whole, semantic analysis can also be associated with uncovering what a single document is about through the processing of words, phrases, sentences and paragraphs. Examples of usage might include understanding a legal contract and which clauses are connected to which legal concept, automatic summary-generation of a document or extracting concepts of interest from a document.
What is Named Entity Recognition?
The most pertinent information in a document is typically revealed in the names that occur within it. Names include those of people, organisations and locations for news data, for example, but might include, for example, gene and protein names if we’re targeting a medical dataset. The types of name we are interested in are specific to the domain of the data we’re working with. Other entities that we are interested in extracting are dates, addresses, monetary values and so on, all of which help to enrich a document’s metadata and help significantly improve search retrieval and linking a company’s knowledgebase.
Named Entity Recognition (NER) describes the concept of labelling sequences of words in a text, which are the names of things (see Figure 1).
Why is NER Valuable?
NER is valuable for the following reasons:
- Quickly identifying what a document is about
- Enhancing search retrieval in terms of faceting and result rank weighting
- Linking documents based on the concepts within them
- Highlighting or redacting where we want to identify all persons in a document (typically in the legal space) and automatically highlight or blank out those names for whatever reason
Check out part 2 of this blog next week for information on implementing NER and RAVN’s approach.