Difference between revisions of "GCat Background"

Revision as of 09:06, 12 July 2016

** THIS DOCUMENT IS A DRAFT **

gCube Data Catalogue.... using CKAN.

CKAN is a powerful data management system that makes data accessible – by providing tools to streamline publishing, sharing, finding and using data. CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data see: http://ckan.org/

gCube Data Catalogue: Metadata

A Metadata in the gCube Data Catalogue is made by two parts: CKAN's default metadata fields and gCube Metadata Profile.

CKAN's default metadata fields

Those are metadata fields common for all metadata types in the gCube Data Catalogue (and used by default in the CKAN platform).

Label	Field Name (API)	Definition	Guidelines	Example
Title*	title	Name given to the dataset.	Short phrase, written in plain language. Should be sufficiently descriptive to allow for search and discovery.	Aquaculture Production and Consumption in Africa (2011)
Description	description	Short description explaining the content and its origins.	Description of a few sentences, written in plain language. Should,provide a sufficiently comprehensive overview of the resource for anyone,to understand its content, origins, and any continuing work on it. The,description can be written at the end, since it summarizes key,information from the other metadata fields.	This dataset contains attributes of aquaculture production and,consumption for each of Africa’s provinces in 2011. The data was,provided by………
Tags	tags	An array of Taxonomic terms stored as tags	Taxonomic terms	Access to education, Bamboo
License*	lincese_title	the license that applies to published dataset.
Organization*	organization	Organization the datasets belongs to	See list of organizations on https://ckan-d-d4s.d4science.org/organization	D4Science
Version	version	Version of dataset	Increase manually after editing	1.0
Author*		Owner of dataset	The person who created the dataset	Joe Bloggs
Author Contact		Contact details of owner	The email or other contact details of the person who created the dataset.	joe@example.com
Mantainer		Mantainer of the dataset	The person who maintains the dataset	Joe Bloggs
Mantainer Contact		Contact details of mantainer	The email or other contact details of the person who maintains the dataset.	joe@example.com

mandatory fields are marked with an asterisk (*)

gCube Metadata Profile

gCube Metadata Profile defines a Metadata schema XML-based for adding custom metadata fields.

A gCube Metadata Profile is composed by one Metadata Format (<metadataformat>) that contains one or many (<metadatafield>). The schema is the following:

<?xml version="1.0" encoding="UTF-8">
<metadataformat>
    <metadatafield>
        <fieldName>Name</fieldName>
        <mandatory>true</mandatory>
        <isBoolean>false</isBoolean>
        <defaulValue>default value</defaulValue>
        <note>shown as suggestions in the insert/update metadata form of CKAN</note>
        <vocabulary>
            <vocabularyField>field1</vocabularyField>
            <vocabularyField>field2</vocabularyField>
            <!-- ... others vocabulary fields -->
        </vocabulary>
        <validator>
            <regularExpression>a regular expression for validating values</regularExpression>
        </validator>
    </metadatafield>
     <!-- ... others metadata fields -->
</metadataformat>

It's possible to validate a Metadata Format schema using following DTD


<?xml version="1.0" encoding="UTF-8"?>
<!ELEMENT metadataformat (metadatafield+)>
<!ELEMENT metadatafield (fieldName, mandatory, isBoolean?, defaulValue?, note?, vocabulary?, validator?)>
<!ELEMENT fieldName (#PCDATA)>
<!ELEMENT mandatory (#PCDATA)>
<!ELEMENT isBoolean (#PCDATA)>  <!-- MUST BE (true|false) -->
<!ELEMENT defaulValue (#PCDATA)> 
<!ELEMENT note (#PCDATA)> 
<!ELEMENT vocabulary (vocabularyField+)> 
<!ELEMENT vocabularyField (#PCDATA)> 
<!ELEMENT validator (regularExpression)> 
<!ELEMENT regularExpression (#PCDATA)>

A possible instance of Metadata Field (<metadatafield>):

<metadatafield>
   <fieldName>Accessibility</fieldName>
   <mandatory>true</mandatory>
   <defaulValue>virtual/public</defaulValue>
   <vocabulary>
       <vocabularyField>virtual/public</vocabularyField>
       <vocabularyField>virtual/private</vocabularyField>
       <vocabularyField>transactional</vocabularyField>
   </vocabulary>
</metadatafield>

SoBigData.eu: Dataset Metadata

The current list of fields characterising a SoBigData resource is available at https://docs.google.com/spreadsheets/d/1kuhvmDVKpmqt2foyCB9wDo3HgzoAiCuRQ8CjRS-DVOM/edit?usp=sharing

The following fields have been identified:

Field	In Catalogue
Internal Fields
Internal Identifier	Automatically created
Creation Date	Automatically created
Last Modification	Automatically updated
General Description
Title	Title
Identifier	<fieldName>External Identifier</fieldName> <mandatory>false</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>This applies only to datasets that have been already published. Insert here a DOI, an handle, and any other Identifier assigned when publishing the dataset alsewhere.</note> <vocabulary></vocabulary> <validator></validator>
Creators	Author is there, unfortunately there is only one author per Dataset. Moreover, the technology supports only key value pairs ... no complex types. <fieldName>Creator</fieldName> <mandatory>true</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>The name of the creator, with email and ORCID. The format should be: family, given[, email][, ORCID]. Examples: Smith, John, js@acme.org, orcid.org/0000-0000-0000-0000; Miller, Elizabeth </note> <vocabulary></vocabulary> <validator></validator>
Creation Date	<fieldName>CreationDate</fieldName> <mandatory>true</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>The date of creation of the dataset (different from the date of creation of the dataset automatically added by the system) </note> <vocabulary></vocabulary> <validator></validator>
Distributor	Maintainer
Publisher	???
Publication Date	when the dataset is published in the repository ... no field have to be specified;
Contact	Go for Maintainer? I would go for Maintainer email
Thematic Cluster	Shall we go for a Topic too? I think so. <fieldName>ThematicCluster</fieldName> <mandatory>true</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>The SoBigData.eu Thematic Clusters </note> <vocabulary> <vocabularyField>Text and Social Media Mining</vocabularyField> <vocabularyField>Social Network Analysis</vocabularyField> <vocabularyField>Human Mobility Analytics</vocabularyField> <vocabularyField>Web Analytics</vocabularyField> <vocabularyField>Visual Analytics</vocabularyField> <vocabularyField>Social Data</vocabularyField> </vocabulary> <validator></validator>
Area	Tag vs domain specific field
Semantic Coverage	Tag vs domain specific field
Time Coverage Start Date	<fieldName>TimeCoverage</fieldName> <mandatory>true</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>List of time intervals, e.g. 1977-03-10T11:45:30 - 2005-01-15T09:10:00</note> <vocabulary></vocabulary> <validator></validator>
Time Coverage End Date	not needed see above
Geo Location	<fieldName>spatial</fieldName> <mandatory>false</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>The value must be a valid GeoJSON geometry, for example: { "type":"Polygon", "coordinates":[[[2.05827, 49.8625],[2.05827, 55.7447], [-6.41736, 55.7447], [-6.41736, 49.8625], [2.05827, 49.8625]]] } or: { "type": "Point", "coordinates": [-3.145,53.078] } </note> <vocabulary></vocabulary> <validator></validator> More on GeoJSON geometry.
ProcessingDegree	Shall we go for a Topic too? I think so. <fieldName>ProcessingDegree</fieldName> <mandatory>true</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>Whether primary or secondary dataset. </note> <vocabulary> <vocabularyField>Primary</vocabularyField> <vocabularyField>Secondary</vocabularyField> </vocabulary> <validator></validator>
ManifestationType	Shall we go for a Topic too? I think so. <fieldName>ManifestationType</fieldName> <mandatory>true</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>Virtual (accessible in streaming from remote sites), replica (copy of data in remote sites, e.g. DBPL), original (collection of data produced and kept in local infra by data provider). </note> <vocabulary> <vocabularyField>Virtual</vocabularyField> <vocabularyField>Replica</vocabularyField> <vocabularyField>Original</vocabularyField> </vocabulary> <validator></validator>
Language	Shall we go for a Topic too? I think so. <fieldName>Language</fieldName> <mandatory>false</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>The primary language of the resource (use ISO 639-1). </note> <vocabulary></vocabulary> <validator></validator>
Description	Description
RelatedLiterature	<fieldName>RelatedPaper</fieldName> <mandatory>false</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>Insert a complete reference to an associated work. </note> <vocabulary></vocabulary> <validator></validator>
RelatedDataset	TBD
Accessibility properties
Accessibility	<fieldName>Accessibility</fieldName> <mandatory>true</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>How the access to the resource is regulated: VA (Virtual Access) or TNA (Trans National Access), Public vs Restricted. </note> <vocabulary> <vocabularyField>VA/Public</vocabularyField> <vocabularyField>VA/Restricted</vocabularyField> <vocabularyField>TNA/Restricted</vocabularyField> </vocabulary> <validator></validator>
AccessibilityMode	<fieldName>AccessibilityMode</fieldName> <mandatory>true</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>How the access to the resource is offered. </note> <vocabulary> <vocabularyField>Programmatic (e.g. API)</vocabularyField> <vocabularyField>By file</vocabularyField> <vocabularyField>...</vocabularyField> </vocabulary> <validator></validator>
Privacy	TBC
Technical properties
Size
DiskSize
Format
FormatSchema
API
Legally and Ethical Aspects
Personal Data	<fieldName>PersonalData</fieldName> <mandatory>true</mandatory> <isBoolean>true</isBoolean> <defaulValue></defaulValue> <note>The dataset contains personal data?</note> <vocabulary> </vocabulary> <validator></validator>
Personal sensitive data	<fieldName>PersonalSensitiveData</fieldName> <mandatory>false</mandatory> <isBoolean>true</isBoolean> <defaulValue></defaulValue> <note>The dataset contains personal sensitive data?</note> <vocabulary> </vocabulary> <validator></validator>
Data set contains data of children	<fieldName>ChildrenData</fieldName> <mandatory>true</mandatory> <isBoolean>true</isBoolean> <defaulValue></defaulValue> <note>The dataset contains children data?</note> <vocabulary> </vocabulary> <validator></validator>
Consent of the data subject	TBD
Consent obtained also covers the envisaged transfer of the personal data outside the EU	TBD
Personal data was manifestly made public by the data subject	TBD
Data Protection Directive	<fieldName>DataProtectionDirective</fieldName> <mandatory>true</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>Report the low or protocol number and the institution related to Data Protection.</note> <vocabulary> </vocabulary> <validator></validator>
Intellectual properties
IP/Copyrights
Link to the source	Resource
License	License
Link to the license	Automatic
Field/Scope of use
Basic rights
Restrictions on use
Prohibited actions
Sublicense rights
Attribution requirements
Display requirements
Distribution requirements
Territory of use
License term
Requirement of non-disclosure (confidentiality mark)

SoBigData.eu: Method Metadata

The current list of fields characterising a SoBigData resource is available at https://docs.google.com/spreadsheets/d/1kuhvmDVKpmqt2foyCB9wDo3HgzoAiCuRQ8CjRS-DVOM/edit?usp=sharing

The following fields have been identified:

Field	In Catalogue
Internal Fields
Internal Identifier	Automatically created
Creation Date	Automatically created
Last Modification	Automatically updated
General Description
Title	Title
Identifier	<fieldName>External Identifier</fieldName> <mandatory>false</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>This applies only to methods that have been already published. Insert here a DOI, an handle, and any other Identifier assigned when publishing the dataset alsewhere.</note> <vocabulary></vocabulary> <validator></validator>
Creators	Author is there, unfortunately there is only one author per item. Moreover, the technology supports only key value pairs ... no complex types. <fieldName>Creator</fieldName> <mandatory>true</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>The name of the creator, with email and ORCID. The format should be: family, given[, email][, ORCID]. Examples: Smith, John, js@acme.org, orcid.org/0000-0000-0000-0000; Miller, Elizabeth </note> <vocabulary></vocabulary> <validator></validator>
Creation Date	<fieldName>CreationDate</fieldName> <mandatory>true</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>The date of creation of the method (different from the date of creation of the dataset automatically added by the system) </note> <vocabulary></vocabulary> <validator></validator>
Distributor	Maintainer
Owner	???
Publication Date	when the method is published in the catalogue ... no field have to be specified;
Contact	Go for Maintainer? I would go for Maintainer email
Thematic Cluster	Shall we go for a Topic too? I think so. <fieldName>ThematicCluster</fieldName> <mandatory>true</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>The SoBigData.eu Thematic Clusters </note> <vocabulary> <vocabularyField>Text and Social Media Mining</vocabularyField> <vocabularyField>Social Network Analysis</vocabularyField> <vocabularyField>Human Mobility Analytics</vocabularyField> <vocabularyField>Web Analytics</vocabularyField> <vocabularyField>Visual Analytics</vocabularyField> <vocabularyField>Social Data</vocabularyField> </vocabulary> <validator></validator>
Area	Tag vs domain specific field
Semantic Coverage	Tag vs domain specific field
Usage mode	<fieldName>UsageMode</fieldName> <mandatory>true</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>How the method is expected to be accessed </note> <vocabulary> <vocabularyField>Download</vocabularyField> <vocabularyField>as-a-Service by SoBigData Infrastructure</vocabularyField> <vocabularyField>as-a-Service by third party infrastructure</vocabularyField> </vocabulary> <validator></validator>
methodURL	Resource
documentationURL	Resource
inputParametersType	<fieldName>input</fieldName> <mandatory>true</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>See WPS </note> <vocabulary> </vocabulary> <validator></validator>
outputType	<fieldName>output</fieldName> <mandatory>true</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>See WPS </note> <vocabulary> </vocabulary> <validator></validator>
Description	Description
RelatedLiterature	<fieldName>RelatedPaper</fieldName> <mandatory>false</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>Insert a complete reference to an associated work. </note> <vocabulary></vocabulary> <validator></validator>
RelatedDataset	TBD
RelatedMethod	TBD
Accessibility properties
Accessibility	<fieldName>Accessibility</fieldName> <mandatory>true</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>How the access to the resource is regulated: VA (Virtual Access) or TNA (Trans National Access), Public vs Restricted. </note> <vocabulary> <vocabularyField>VA/Public</vocabularyField> <vocabularyField>VA/Restricted</vocabularyField> <vocabularyField>TNA/Restricted</vocabularyField> </vocabulary> <validator></validator>
AccessibilityMode	<fieldName>AccessibilityMode</fieldName> <mandatory>true</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>How the access to the resource is offered. </note> <vocabulary> <vocabularyField>Programmatic (e.g. API)</vocabularyField> <vocabularyField>By file</vocabularyField> <vocabularyField>...</vocabularyField> </vocabulary> <validator></validator>
Technical properties
Programming Language	<fieldName>ProgrammingLanguage</fieldName> <mandatory>false</mandatory> <isBoolean>false</isBoolean> <defaulValue></defaulValue> <note>The primary language used to implement the method. </note> <vocabulary></vocabulary> <validator></validator>
Hosting Environment
Source code
Artifact repository
Dependencies on Other SW
Intellectual properties
IP/Copyrights
License	License
Link to the license	Automatic
Field/Scope of use
Basic rights
Restrictions on use
Prohibited actions
Sublicense rights
Attribution requirements
Display requirements
Distribution requirements
Territory of use
License term
Requirement of non-disclosure (confidentiality mark)

gCube Data Catalogue: Ckan Connector

gCube Data Catalogue: Geo Harvesting

This extension contains plugins like ckanext-geonetwork (and others) which add geospatial capabilities to CKAN.

Several harvesters to import geospatial metadata into CKAN from other sources in ISO 19139 format and others has been created in gCube Data Catalogue. In particular all metadata created into gCube Geonetwork (GeoNetwork is the catalog application to manage spatially referenced resources generated into D4Science Infrastructure) are harvested through the 'Geoentwork Resolver' a "middle tier" able to:

use the Geonetwork Manager in order to harvest private metadata (via authentication) stored in gCube Geonetwork on CKAN Data Catalogue (ex. http://data-d.d4science.org/geonetwork/gcube_devsec_devVRE to harvest private metadata generated from scope /gcube/devsec/devVRE);

create a CKAN Harvester that skip all public metadata via configuration during scope harvesting (ex. http://data-d.d4science.org/geonetwork/gcube_devsec_devVRE%23filterpublicids to filter public ids during harvesting of /gcube/devsec/devVRE);

create a CKAN Harvester to harvest only public metadata (saved on Geonetwork) avoiding the Geonetwork authentication via configuration (ex. http://data-d.d4science.org/geonetwork/gcube_devsec_devVRE%23noauthentication).

Mapping (among fields) from an ISO19139 Metadata to Ckan Dataset via ckanext-geonetwork is showed in the following table:

ISO19139	Ckan Dataset
Title	Title
Description	Description

Digital Transfer Option	Data and Resource
CI_OnlineResource
gmd:url	URL
gmd:name	Name
gmd:description	Description

Descriptive Keywords
gmd:keyword	Tag
	Additional Info
bbox, metadata language, age, reference system, etc.	key/value

gCube Data Catalogue: Geo Datasets

In order to make a dataset queryable by location (geospatial dataset), a special extra must be defined, with its key named ‘spatial’. The value must be a valid GeoJSON geometry, for example:

{
  "type":"Polygon",
  "coordinates":[[[2.05827, 49.8625],[2.05827, 55.7447], [-6.41736, 55.7447], [-6.41736, 49.8625], [2.05827, 49.8625]]]
}

[Note: the polygon must be closed]

or

{
  "type": "Point",
  "coordinates": [-3.145,53.078]
}

GeoJSON Format Specification are available here: http://geojson.org/geojson-spec.html Datasets with spatial values are automatically geo-indexed, for example so that they can be searched using spatial filters.

GeoSpatial search for datasets: via API or Search Widget

Once your datasets are geo-indexed, you can perform spatial queries by bounding box (coordinates format is [LONG, LAT]), via the following API call:

/api/2/search/dataset/geo?bbox={minx,miny,maxx,maxy}[&crs={srid}]

If the bounding box coordinates are not in the same projection as the one defined in the database, a CRS must be provided, in one of the following forms:

    urn:ogc:def:crs:EPSG::4326
    EPSG:4326
    4326

Otherwise default bounding box is 4326. CKAN Wiki page for Legacy API

Moreover, you can perform spatial queries using an integrated map widget in CKAN, which allows filtering results by an area of interest. You can try it on D4Science Data Catalogue

CKAN Wiki page for Spatial Search Widget

@@ Line 796: / Line 796: @@
 == gCube Data Catalogue: Geo Harvesting ==
-This extension contains plugins like [https://github.com/geosolutions-it/ckanext-geonetwork/wiki ckanext-geonetwork] (and other plugins) which add geospatial capabilities to CKAN.
+This extension contains plugins like [https://github.com/geosolutions-it/ckanext-geonetwork/wiki ckanext-geonetwork] (and others) which add geospatial capabilities to CKAN.
 Several harvesters to import geospatial metadata into CKAN from other sources in ISO 19139 format and others has been created in gCube Data Catalogue.

Difference between revisions of "GCat Background"

Revision as of 09:06, 12 July 2016

Contents

gCube Data Catalogue: Metadata

CKAN's default metadata fields

gCube Metadata Profile

SoBigData.eu: Dataset Metadata

SoBigData.eu: Method Metadata

gCube Data Catalogue: Ckan Connector

gCube Data Catalogue: Geo Harvesting

gCube Data Catalogue: Geo Datasets

GeoSpatial search for datasets: via API or Search Widget

Navigation menu

Views

Personal tools

gCube Wiki

gCube features

gCube documentation

Integration and Distribution

Search

Tools