Spanish web archive: Old collection

Open data API in a single place

Provided by Ministerio de Hacienda y Administraciones Públicas

Get early access to Spanish web archive: Old collection API!

Let us know and we will figure it out for you.

Dataset information

Catalog

datos.gob.es

Country of origin

Spain

Updated

2021.03.29 22:00

Created

Available languages

Spanish

Keywords

Datasource

Official portal for European data link

Quality scoring

200

Dataset description

The collection of web pages is the main way to carry out the legal deposit of online publications. It is carried out with crawling robots that go through the previously selected URLs and saving everything they have linked with the frequency, depth and size that is determined. The result of these web collections are web files. Today it is impossible to aspire to completeness in the archived web, so in the National Library of Spain has opted for a mixed model that combines massive and selective collections: 1. Massive collections collect as many domains as possible with a small depth in navigation levels and are linked to the.es domain. They are made once a year. 2. Selective collections are made to complete the mass collections, as they more deeply and frequently collect a smaller sample of websites selected for their relevance to history, society and culture. They are carried out several times a year in collaboration with the conservation centers of the autonomous communities and other specialised institutions. These selective collections can be of three types: 2.1. Themes: Each Department of the National Library and each autonomous community maintains their thematic collections with the online resources they deem necessary to keep as part of the legal deposit. For example: Music and Audiovisuals, Andalusian Electronic Magazines, Institutions of the Valencian Community, etc. 2.2. Event: on events of special relevance. 2.3. Emergency, in the case of websites in danger of extinction. **Downloadable file fields:** * Website title * Seed: it is the URL we provide as a starting point for collection. You can represent the home page of a site, a section of a site, or a document with other formats contained on a web page. * Additional URLs: we can add additional URLs to improve tracking coverage or quality (e.g. website map, an important section, etc.). * Status: we will put “Active” if we want to collect the website or “Inactive” if we want to stop collecting it, for example in the event that the website has ceased to exist. * Frequency: it is the periodicity with which we want to collect the website. Frequencies can be Daily, Monthly, Quincenal, and Unique (if you only want to collect once). * Depth: it is the level of depth with which we want to collect the website, that is how much the robot will descend following the links contained in the URL that we give it as a seed. The depth can be: Home: Collects only the URL that is given as seed. Start and 1 level: Collect the URL that is given as seed plus a depth level. Start and 2 levels: Collect the URL that is given as seed plus two levels of depth. Domain: Collects all URLs containing the proposed domain. For example, from the seed www.bne.es, collects all URLs containing “bne.es”. Host: Collects all URLs containing the proposed host. For example, from the seed www.bne.es, collects all URLs that have www.bne.es. Route: collects only the URLs from the path we give you, do not go back to URLs in previous directories. * Size: Small: to collect websites up to 10,000 URLs. Medium: to collect websites up to 50,000 URLs. Large: to collect websites up to 100,000 URLs. * Keywords: they more accurately describe the content of the resource to be collected and allow the creation of subcollections within a collection. Are assigned between 1 and 5 words per record, separated by/ * Material: The materials of each collection allow us to distinguish the different sub-collections that the Autonomous Communities have. An abbreviated CDU and its literal are assigned. Contact: [[email protected]](and mailto:[email protected]) How to cite the set: Title of the data set. [Data set]. Version of DDMMAAAA. Data.gob.es. Dataset URL E.g. Archive of the Spanish Web: Autonomous Community of Aragon. [Data set]. January 2019 version. Data.gob.es. https://datos.gob.es/es/catalogo/ea0019768-archivo-de-la-web-espanola-comunidad-autonoma-de-aragon

Build on reliable and scalable technology

FAQ

Frequently Asked Questions

Some basic informations about API Store ®.

Operation and development of APIs are currently fully funded by company Apitalks and its usage is for free.

Yes, you can.

All important information such as time of last update, license and other information are in response of each API call.

In case of major update that would not be compatible with previous version of API, we keep for 30 days both versions so you will have enough time to transfer to new version. We will inform you about the changes in advance by e-mail.

Didn't find the API you need?

Let us know and we will figure it out for you.

API Store ®

API Store provides access to European Open Data via scalable and reliable REST API interface.