Målfrid 2021 - Freely Available Documents from Norwegian State Institutions

Open data API in a single place

Provided by difi

Get early access to Målfrid 2021 - Freely Available Documents from Norwegian State Institutions API!

Let us know and we will figure it out for you.

Dataset information

Country of origin
Updated
2021.04.30 00:00
Created
2020.12.01
Available languages
Norwegian
Keywords
korpus, språkforskning, språkteknologi, tekst, språkbanken
Quality scoring
245

Dataset description

This corpus consists of documents from 339 internet domains of Norwegian state institutions and comprises approximately 4.1 billion tokens in total, which makes it one of the largest freely available resources for Norwegian Bokmål and Nynorsk. In addition to Norwegian, the corpus contains texts in Northern Sami, Lule Sami, Southern Sami and English. The data were collected as part of the so-called Målfrid project, where the National Library of Norway on behalf of the Ministry of Culture and in collaboration with the The Language Council of Norway collects and aggregates data for mapping the usage of Norwegian Bokmål and Norwegian Nynorsk in Norwegian state institutions. The corpus is the result of a focused crawl conducted between December 11th 2020 and January 18th 2021, recursively downloading text documents (HTML, DOC(X)/ODT and PDF) from a set of domains (down to and including level 12), while obeying robots.txt and politeness restrictions. The crawled documents were further processed according to their format: Natural language was extracted from HTML using the boilerplate removal system Justext (http://corpus.tools/wiki/Justext), from the Word/ODT documents using Textract (https://textract.readthedocs.io/en/stable/) and from the PDFs using Google Cloud Vision OCR. The extracted text was classified using TextCat language identification (cf. https://www.let.rug.nl/~vannoord/TextCat/) on document-level, provided as part of the metadata. The documents were deduplicated on domain level (exact duplicates). The corpus is provided as gzipped JSON lines (jsonl), one document per line. There is one JSONL file per combination of domain, language and content type. The files are encoded as UTF-8, with ASCII escape sequences. Each dictionary contains the following keys: - lang: language of the document (detected using TextCat) - url: the url of the document at crawl time - date: crawl date - mimetype: media type of the document (simplified): HTML, DOC or PDF - fulltext: an array of strings, where each string represents one paragraph. An empty string denotes a new page in the PDF documents
Build on reliable and scalable technology
Revolgy LogoAmazon Web Services LogoGoogle Cloud Logo
FAQ

Frequently Asked Questions

Some basic informations about API Store ®.

Operation and development of APIs are currently fully funded by company Apitalks and its usage is for free.
Yes, you can.
All important information such as time of last update, license and other information are in response of each API call.
In case of major update that would not be compatible with previous version of API, we keep for 30 days both versions so you will have enough time to transfer to new version. We will inform you about the changes in advance by e-mail.

Didn't find the API you need?

Let us know and we will figure it out for you.

API Store provides access to European Open Data via scalable and reliable REST API interface.
Copyright © 2024. Made with ♥ by Apitalks