RECAM-Project: Significant improvement of text mining

September 16, 2019 ALL NEWS, RESEARCH

From 2017 until end of August 2019 deecoob was successfully working on the research project “RECAM”. This was a cooperation project between deecoob Technology GmbH and the Forensic Sience Investigation Lab at the University of Applied Sciences Mittweida (HSMW). The abbreviation “RECAM” stands for retrospektive event monitoring for computer forensic enlightenment of copyright misuse on the basis of publicly available digital media. The goal was to develop a technology for automated, computer-forensic investigation of use and misuse of copyrights at music events. The application uses web crawlers and full-text indexers and provides interfaces with Facebook, Instagram, Twitter, epaper, and Web pages. The purpose is to analyze, process, and qualify textual data from publicly available sources.

In order to control and enforce copyrights and exploitation rights at commercial music events, a wide variety of sources (social media, ePapers, portals, websites …) must be searched, extracted and classified according to their relevance to the respective legal claims. This is still a highly manual and thus cost-intensive and error-prone process, in which automated text retrieval can currently only form a preliminary stage of document classification and evaluation. According to current state of the art, conventional methods of text mining such as bag-of-words, tf-idf or NLP methods provide only insufficient results if the outlined method is to be fully automated. During RECAM research project we developed solutions based on different technologies. In their combination they will constitute the functional, vector space-based retrieval and classification system at the end of the project with which the manual work described above can be automated as far as possible.

At the beginning of the project, the manual approach of the administrators in the data research for the license verification of music events or the commercial use of music in public space was analyzed in detail. The aim of this analysis was to gain a deeper understanding of the data quality of the different sources than to obtain information on how to search for in these sources in order to be able to check licenses later. From this analysis it emerged that, on the one hand, the sources can be divided into structured and unstructured information components and, on the other hand, into obligatory (i.e. necessary information, e.g. venue) and optional (i.e. additional information, e.g. admission price).

For the task of automatically checking licenses for this kind of commercial music use, the information from such advertising texts a) must be identified and extracted b) must be interpreted and c) must be resolved and enriched. As a result, unstructured data is available as structured data through AI automation and can be automatically compared with an existing license database. The result of this analysis is a schema definition that describes the target structure into which data from any source must be transferred.

In the second project step, a format was developed to describe the mandatory meta-properties. This is the prerequisite both for the automated resolution with regard to the meta properties mentioned above and for the enrichment of information that is available in a text. Furthermore, these meta-properties can significantly improve the text classification. Within the framework of the project, an NLP (Natural Language Processing) method was developed and tested, which can be used to automatically identify Named Entity Recognition in texts. The identified terms are used to determine whether they are names of artists or bands, or whether they refer to names of venues. The appropriate property is added to the data record as a meta tag. To determine these properties, external knowledge bases such as MusicBrainz, DB Pedia and others were connected and successfully tested. In text classification, the meta-properties improve the accuracy of the work, as ambiguities can be resolved.

In the further course of the project, a test environment was developed to prototype the complete process of text classification, from model creation (selection of training data) to a learning method (active learning) to the model test (gold standard). Based on the previously defined structure, an Active Learning procedure was developed for the compilation of training and test data in order to determine the relevance of an event advertisement text via a text classification by determining two meta-properties – event and music. A study with 35 participants was carried out together with the HSMW. Subsequently, a scientific paper (“Towards an Automated System for Music Event Detection”, see the corresponding interim report of the HS Mittweida) was published by the HSMW.

By annotating the properties “Events” and “Music” for 1,018 randomly selected text documents, it could be shown that the selection of training and test data based on a majority vote of the test subjects can lead to a significant increase in the quality of a text classifier. The results of this study show that the described procedure for annotating training and test data, together with the determination of previously analysed meta-properties, has a positive effect on the accuracy of the model. Furthermore, determining a single meta-property is far less complex than assessing overall relevance. Thus an overall decision about the relevance of a data set can be broken down into its partial aspects (music yes / no, event yes / no etc.), which in turn can be represented in a defined structure as described above.

The classification pipeline was implemented as hybrid cloud architecture. The data is stored in deecoob’s own infrastructure. An Elasticsearch cluster is used for this. The application for displaying and labeling the training data (Active Learning) as well as the calculations of the feature Extraction & Selection and the various classifiers were realized as cloud services in the form of docker containers. A scaling of the computation-intensive components of the system (clusterers, classifiers) to the desired processing speed of > 1,000,000 documents per week could be proven prototypically. The transferability of the project results and the prototype development was also verified for the application domain “film screenings”. Here, the same test results with regard to recall / precision and data throughput could be achieved as for the “original” domain “music events”, on the basis of which the development of the project initially took place.

The project went extremely well for deecoob. The cooperation with the project partner HSMW was smooth and ultimately decisive for the success of the project. The cooperation will be continued in further projects! The originally set of goals with regard to recall / precision of the system as well as with regard to its scalability and performance in the cloud were all achieved. In initial discussions with customers, a great deal of interest in the technology was noted, especially in international markets. Test runs of the developed technology at potential customers are carried out directly after the project conclusion.

We are very grateful for the research project was funded by the Central Innovation Program for SMEs of the Federal Ministry of Economics and Energy.

deecoob blog

RECAM-Project: Significant improvement of text mining

Content Extraction

Select you category

Join our team!