There are multiple ways we go about implementing NER. A simple method would be to have a dictionary of words that belong to a certain type of entity (e.g. a list of all the countries in the world) and do simple string matching against a provided document. Despite the simplicity of this approach it is highly effective and indeed forms one aspect of RAVN’s approach to entity extraction.
However, the generation of these lists is a very labour intensive task and often doesn’t extend to the multiple domains of problems we are faced with. Also, new names for people and organisations can arise and we’d like a system that can uncover those in a document without having to continuously update a set of dictionaries. An approach to deal with this would be to analyse word and sentence structure. For example, it is clear that we could extract names from a sentence such as this: Barack Obama works at the White House in Washington DC by formulating a rule that maps to such a sentence: <PERSON> works at <ORGANISATION> in <LOCATION>. The approach is effective but disadvantages include:
- A large number of rules would need to be formulated
- Different rule sets would need to be formulated for different languages
- Often it is difficult to form a rule that would ensure we extract everything that is relevant
- The formulation of rules is a very labour intensive process and requires expert insight
RAVN’s Approach to NER
Machine Learning approaches to data analysis are heavily leaned upon at RAVN Systems. The fundamental concept of machine learning is that we learn from the data itself. We use some domain-specific knowledge to help derive the form of model (e.g. type of Neural Network or particular statistical model) we should use and to drive the design of features to be extracted from the data and fed into the model. Once these have been decided upon, pre-labelled training data is supplied to train up our model. The trained model can then be used to label previously unseen data.
With regards to NER, we use a state-of-the art statistical Conditional Random Field model, which is global in scope in that the probability of a word belonging to a certain type of entity is conditional on all the words surrounding it. We use a wide range of textual features, ranging from yes/no features, such as:
- Is the word capitalised?
- Does the word contain a number?
- Does the word contain punctuation?
Which are encapsulated as word shape features and further features used include word ngrams and context (i.e. previous and subsequent words). The selection of features is a crucial factor affecting performance of the algorithm in labelling previously unseen data.
Training data itself is also a critical factor affecting performance. In our case we have built a model using a standard news data set(*) that is pre-labelled with 3 types of entity: person, organisation and location. The resulting performance is optimised for news data although we have also seen good performance when applying the model to a range of other use cases, including the analysis of video transcripts of interviews and the biographies and résumés of a company workforce. The advantages of this machine learning approach include:
- We have a model that performs well for news data
- Updating the model for a different domain only requires introducing new training data specifically targeted at that domain. This is a one-time, upfront manual process but does not typically require the presence of an expert
- The approach works for multiple languages provided training data for the specific language of interest is supplied
The RAVN NER Server encompasses all our work on NER and there is a dedicated stage in RAVN Pipeline that makes call outs to it to enrich document metadata as its crawled by the Pipeline. In its raw form the RAVN NER Server receives text as a POST-ed HTTP request and returns JSON-formatted entities (see Figure 2).
We showcase this technology with a real-life use case where we crawled a legal firms’ publically available employee database containing biographies. The crawling was done using RAVN Pipeline’s web connector and Pipeline was configured to talk to RAVN NER Server to enrich the data as it was fetched. Pipeline parses HTML tags to generate useful attributes (e.g. the languages an employee speaks) and the NER stage extracts the customers and other interesting entities that were embedded within the raw biography text (see Figure 3). This is highly valuable in terms of automatically forming connections within the company workforce and significantly improves retrieval of employees for a specific task (for example finding all employees who may have previously represented a certain customer).
We are always looking to extend usability and performance of RAVN NER through improvements of the core algorithms and introductions of different domain-specific data sets. If RAVN NER looks like it could solve a problem or help the valuable enrichment of metadata at your company then be sure to get in touch with us.
* CoNLL2003: http://www.cnts.ua.ac.be/conll2003/ner/