Norwegian Parliamentary Speech Corpus

Open data API in a single place

Provided by difi

Get early access to Norwegian Parliamentary Speech Corpus API!

Let us know and we will figure it out for you.

Dataset information

Catalog

data.norge.no

Country of origin

Norway

Updated

2021.11.30 00:00

Created

2019.08.01

Available languages

Norwegian

Keywords

språkforskning, språkteknologi, språkbanken, korpus, tale

Datasource

Official portal for European data link

Quality scoring

245

Dataset description

This is version 1.1 of The Norwegian Parliamentary Speech Corpus (NPSC). The following changes have been made in the update from version 1.0 to 1.1: - The data was split in an official training, evaluation and test set. - Manual dialect annotations were added for each speaker. - The end time of one sentence in 20171208 (sentence_id 45886), was changed, as a 30 minute break was included in the sentence time span in version 1.0. The corresponding audio file (20171208-085509_6122400_6124160.wav was shortened accordingly. - Some of the metadata in the transcriptions of 20171213 were lacking in the json transcription files. These are added in version 1.1. - The documentation was updated to reflect these changes. The corpus is developed by the Norwegian Language Bank at the National Library of Norway from 2019-2021. The NPSC consists of audio recordings of meetings in Stortinget (the Norwegian parliament), and corresponding orthographic transcriptions in either Norwegian Bokmål or Norwegian Nynorsk, as well as various metadata about the speakers. The official proceedings from the meetings are also included in the corpus for reference. The recordings amount to 140 hours of running speech (including pauses) from 267 unique speakers, and contain 65.000 sentences and 1.2 million words. Transcription was first done automatically; subsequently, the output of the automatic process was manually checked and corrected by trained linguists and philologists. Finally, all transcriptions were proofread to ensure consistency and accuracy. NPSC is primarily intended as an open-source dataset for ASR development. The audio files in the corpus contain the speech of entire days of plenary meetings from 2017 and 2018 (or, if a meeting lasts more than six hours, the first six hours of a day). Since the audio files are quite large, individual audio files for each sentence are also included. Betareleases of the NPSC were published in 2020 and 2021. Note that we have run postprocessing scripts since the last release (0.2) which affect all transcriptions, and the formatting of the transcriptions is different from previous releases. Users should therefore replace old transcription files with the files in this release. We greatly appreciate any feedback and suggestions for improvement. Please use our e-mail address, [email protected].

Build on reliable and scalable technology

FAQ

Frequently Asked Questions

Some basic informations about API Store ®.

Operation and development of APIs are currently fully funded by company Apitalks and its usage is for free.

Yes, you can.

All important information such as time of last update, license and other information are in response of each API call.

In case of major update that would not be compatible with previous version of API, we keep for 30 days both versions so you will have enough time to transfer to new version. We will inform you about the changes in advance by e-mail.

Didn't find the API you need?

Let us know and we will figure it out for you.

API Store ®

API Store provides access to European Open Data via scalable and reliable REST API interface.