Autonomous Knowledge Extractor

Open data API in a single place

Provided by Ministry of Administration and Digitization of Poland

Get early access to Autonomous Knowledge Extractor API!

Let us know and we will figure it out for you.

Dataset information

Catalog

dane.gov.pl

Country of origin

Poland

Updated

2023.03.10 13:24

Created

2023.03.10

Available languages

Polish

Keywords

NLP, ekstrakcja encji z dokumentów, tagowanie semantyczne, Ekstrakcja prostych atrybutów skalarnych, ekstrakcja ontologii, modelowanie tematyczne, przetwarzanie danych, ekstrakcja danych, pre processing dokumentów

Datasource

Official portal for European data link

Quality scoring

295

Dataset description

Industrial research: Task No. 1 - Development of algorithms for extracting objects from data The task includes industrial works consisting in the development of algorithms for extracting objects from data. The basic assumption of the semantic web is to operate on objects that have specific attributes and relations between them. Assuming that the input data to the system usually have a weak structure (textual or structured documents with general attributes, e.g. title, creator, etc.), it is necessary to develop methods for extracting objects of basic types representing typical concepts from the real world, such as people, institutions, places, dates etc. Tasks of this type are performed by algorithms from the group of natural language processing and entity extraction. The main technological issue of this stage was to develop an algorithm that would extract entities from documents with a weak structure - in the extreme case, text documents - as efficiently as possible. For this purpose, it was necessary to process documents included in the shared internal representation and extract entities in a generalized way, regardless of the source form of the document. Detailed tasks (milestones): Development of an algorithm for pre-processing data and internal representation of a generalized document. As part of the task, methods of pre-processing documents from various sources and in various formats will be selected to a common form on which further algorithms will operate. As input - text documents (pdf, word etc), scans of printed documents (we do not include handwriting), web documents (HTML pages), other databases (relational tables), csv/xls files, XML files Development of an algorithm for extracting simple attributes from documents - Extraction of simple scalar attributes from processed documents, such as dates and numbers, taking into account the metadata existing in source systems and document templates for groups of documents with a similar structure. Development of an entity extraction algorithm from documents for basic classes of objects - entity extraction in unstructured text documents based on NLP techniques based on the developed language corpus for Polish and English with the possibility of development for other languages, taking into account the basic types of real-world objects (places, people , institutions, events, etc.) Industrial research: Task No. 2 - Development of algorithms for automatic ontology creation As part of task 2, it is planned to develop algorithms for automatic ontology creation. Reducing the impact of the human factor on data organization processes requires the development of algorithms that will significantly automate the process of classifying and organizing data imported to the system. It requires the use of advanced knowledge modeling techniques such as ontology extraction and thematic modeling. These algorithms are usually based on text statistics and the quality of their operation largely depends on the quality of the input data. This creates the risk that models created by algorithms may differ from expert models used by field experts. It is therefore necessary to take this risk into account in the architecture of the solution. Detailed tasks (milestones): Development of an algorithm for organizing objects in dictionaries and deduplication of entities in dictionaries - The purpose of the task is to develop an algorithm that organizes objects identified in previously developed algorithms in such a way as to prevent duplication of objects representing the same concepts and to enable the presentation of appropriate relationships between nodes of the semantic network. Development of an extraction algorithm for a domain ontological model - Requires the use of sophisticated methods of analyzing the accumulated corpus of documents in terms of identifying concepts and objects specific to the domain. The task will be carried out by a research unit experienced in the field of creating ontological models. Development of a semantic tagging algorithm - Requires the use of topic modeling methods. The task will be carried out by a research unit experienced in the field of creating ontological models. Development of a method of representing the semantic model in the database - The aim of the task is to develop a method of encoding information resulting from the operation of previous algorithms in such a way that it can be saved in a scalable manner in the appropriate database. Experimental development work: Task No. 3 - Prototype of the system The purpose of this task was to create an application prototype that would enable validation of the possibility of implementing the application on a real scale of applications (millions of documents) and functional usability for the end user. The problem faced by semantic modeling researchers is that they often work with theoretical models expressed in languages that are optimal for mathematical modeling but unscaled for production use. Therefore, it was necessary to develop an architecture that would enable scaling of the developed algorithms to process large data sets. Another aspect of semantic solutions is the problem of usability for end users. These solutions are based on advanced concepts, which forces a complex internal structure of the systems and complicated access to data. To ensure the usability of the project, it was necessary to develop a user interface that would offer the use of advanced data operations to the common user. Detailed tasks (milestones): Development of methods for obtaining data from various sources - the goal of the task is to develop an appropriate architecture and pipelines for processing data obtained from heterogeneous sources and formats in order to collect them in a coherent form in a central knowledge repository. It requires the use of an ETL/ESB type architecture based on a queuing system and distributed processing. Development of a large-scale data processing architecture by developed algorithms - the goal of the task is to develop an implementation architecture that would enable the implementation of the developed algorithms on a large scale, e.g. on the basis of distributed processing systems such as Apache Spark. Development of scalable data storage methods - the aim of the task is to select a data storage environment that enables effective representation of knowledge as a semantic network. The use of a graph database engine or a base that supports the RDF format will be required here. Development of an API enabling data mining - the aim of the task is to develop an API enabling the use of semantic knowledge accumulated in the system by various types of algorithms for further data processing, machine learning and artificial intelligence. A probable solution here may be to create an interface based on the SPARQL standard. Development of a prototype of a user interface for data mining - the aim of the task is to develop an ergonomic interface that allows domain users to explore and analyze the collected data. It is necessary to develop a method of generating an interface that automatically adapts to the type of data that is collected in the system, enabling data exploration by asking queries on the "Query By Example" basis, faceted/faceted search and traversing relationships between entities in the semantic model.

Build on reliable and scalable technology

FAQ

Frequently Asked Questions

Some basic informations about API Store ®.

Operation and development of APIs are currently fully funded by company Apitalks and its usage is for free.

Yes, you can.

All important information such as time of last update, license and other information are in response of each API call.

In case of major update that would not be compatible with previous version of API, we keep for 30 days both versions so you will have enough time to transfer to new version. We will inform you about the changes in advance by e-mail.

Didn't find the API you need?

Let us know and we will figure it out for you.

API Store ®

API Store provides access to European Open Data via scalable and reliable REST API interface.