Metadata Broker

From Gcube Wiki
Revision as of 13:16, 29 May 2007 by Sboutsis (Talk | contribs) (Transforming a single record using a XSLT)

Jump to: navigation, search

Metadata Broker

Introduction

The main functionality of the Metadata Broker is to convert XML documents from some input schema and/or language to another. The inputs and outputs of the transformation process can be single records, ResultSets or entire collections. In the special case where both the inputs and the output are collections, a persistent transformation is possible, meaning that whenever there is a change in the input collection(s), the new data will be automatically transformed in order for the change to be reflected to the output collection.

Transformation Programs

Complex transformation processes are described by transformation programs, which are XML documents. Transformation programs are stored in the DIS. Each transformation program can reference other transformation programs and use them as “black-box” components in the transformation process it defines.

Each transformation program consists of:

  • One or more input definitions. Each one may be a:
    • Data input: accepts a reference to a ResultSet, Collection or Record to be transformed. It also contains the schema-id and language-id that the input data should have.
    • Input variable: a placeholder for an additional string value which must be passed to the transformation program at run-time.
  • Exactly one output definition, which contains the output data type (Record, ResultSet or collection), schema and language.
  • One or more transformation rule definitions.

Note: The name of the input or output schema must be given in the format NAME=URI, where NAME is the name of the schema and URI is the URI of its definition, e.g. DC=http://dublincore.org/schemas/xmls/simpledc20021212.xsd.

Transformation Rules

Transformation rules are the building block of transformation programs. Each transformation program always contains at least one transformation rule. Transformation rules describe simple transformations and execute in the order in which they are defined inside the transformation program. Usually the output of a transformation rule is the input of the next one. So, a transformation program can be thought of as a chain of transformation rules which work together in order to perform the complex transformation defined by the whole transformation program.

Each transformation rule consists of:

  • One or more inputs definitions. Each definition contains the schema, language, type (record, ResultSet, collection or variable)and data reference of the input it describes. Each one of these elements (except for the 'type' element) can be either a literal value, or a reference to another value defined inside the transformation program (using XPath syntax).
  • Exactly one output, which can be:
    • A definition that contains the output data type (Record, ResultSet or collection), schema and language.
    • A reference to the transformation program‘s output (using XPath syntax). This is the way to express that the output of this transformation rule will also be the output of the whole transformation program, so such a reference is only valid for the transformation program‘s final rule.
  • The name of the underlying program to execute in order to do the transformation.

Note: The name of the input or output schema must be given in the format NAME=URI, where NAME is the name of the schema and URI is the URI of its definition, e.g. DC=http://dublincore.org/schemas/xmls/simpledc20021212.xsd.

Programs

A program (not to be confused with transformation program) is the Java class which performs the actual transformation on the input data. A transformation rule is just a XML description of the interface (inputs and output) of a program. A program must implement the Program Java interface:

interface Program {
  public String getOutput();
}

getOutput() returns the output of the transformation program as a string. If the output is a record, the return value should be the transformed record. If the output is a ResultSet, the return value should be the ResultSet EPR. Finally, if the output is a collection, the return value should be the collection id.

The Program interface does not define any transformation methods. Each program can define any number of methods, but when the transformation rule which references it is executed, the metadata broker service will use reflection in order to locate the correct method to call based on the input and output types defined in the transformation rule that initiates the call to the program's transformation method. The valid data types for the parameters of each transformation method (so that the broker can locate and use them) are:

  • RecordType: A data type that holds the schema, language and payload of a full record.
  • ResultSetType: A data type that holds the schema, language and EPR of a ResultSet.
  • CollectionType: A data type that holds the schema, language and id of a collection.
  • VariableType: A data type that holds the string value of a variable defined inside a transformation program.

The definitions of these data types are contained in the metadata broker library.

When a transformation method of a program is called as the result of the execution of a transformation rule with N inputs and one output, the following convention is used:

  • The first N parameters passed to the method are objects holding information about the input data.
  • The last parameter is an object holding information about the output data.

The type of each parameter should one of the four types mention before (RecordType, ResultSetType, CollectionType, VariableType).

Implementation Overview

The metadata broker consists of two components:

  • The metadata broker service
    The metadata broker service provides the functionality of the metadata broker in the form of a stateless service. In the case of a persistent transformation, the service creates a WS-Resource holding information about this transformation and registers for notifications concerning changes in the input collection(s). The created resources are not published and remain completely invisible to the caller.

    The service exposes the following operations:
    • transform(TransformationProgramID, params) -> String
      This operation takes the DiligentID of a transformation program stored in the DIS and a set of transformation parameters. The referenced transformation program is executed using the provided parameters, which are just a set of value assignments to variables defined inside the transformation program. The metadata broker library contains a helper class for creating such a parameter set.
    • transformWithNewTP(TransformationProgram, params) -> String
      This operation offers the same functionality as the previous one. However, in this case the first parameter is the full XML definition of a transformation program in string format and not the DiligentID of a stored one.
    • findPossibleTransformationPrograms (InputDesc, OutputDesc) -> TransformationProgram[]
      This operation takes the description of some input format (type, language and schema) as well as the description of a desired output format, and returns an array of transformation programs definitions that could be used in order to perform the required conversion. These transformation programs may not exist before invoking this operation. They are produced on the fly, by combining all the existing transformation programs which are compatible with each other, trying to synthesize more complex transformation programs. Of course, if there is already an existing transformation program which is applicable for the requested type of transformation, it is included in the results.

  • The metadata broker library
    The metadata broker library contains the definitions of the RecordType, CollectionType, ResultSetType and VariableType Java classes, as well as the definition of the Program Java interface. The following programs are also included in it:
    • Generic XSLT record-to-record transformer (GXSLT_Rec2Rec): transforms a given record using a given XSLT definition. The output is the transformed record.
    • Generic XSLT ResultSet-to-ResultSet transformer (GXSLT_RS2RS): transforms a given ResultSet using a given XSLT definition, producing a new ResultSet. The output is the new ResultSet's EPR.
    • Generic XSLT Collection-to-Collection transformer (GXSLT_Col2Col): transforms a given collection using a given XSLT, producing a new colletion. The output is the new collection id.
    • Generic XSLT ResultSet-to-Collection transformer (GXSLT_RS2Col): transforms the records of a given ResultSet using a given XSLT, and adds them to a new collection with caller-defined attributes. The output is the new collection id.
    • Generic XSLT Collection-to-ResultSet transformer (GXSLT_Col2RS): transforms each record of a given collection using a given XSLT and creates a new ResultSet containing the transformed records. The output is the new ResultSet's EPR.
The transformation of metadata using any of the above programs, except for the GXSLT_Rec2Rec program, is a non-blocking operation. This means that the caller will not block until the transformation is completed, since the process of transforming a big ResultSet or collection may be quite time-consuming. For this purpose, each program prepares the output data (which is either the endpoint reference of the output ResultSet or the ID of the output collection, depending on the output data type of the transformation) which should be returned to the caller and then spawns a new thread to perform the transformation process.
Internally, some programs depend on others, meaning that they use other programs in order to avoid useless code duplication. For instance, the GXSLT_Rec2Rec program is used by every other program because the transformation of any complex type of data input (such as ResultSets or collections) finally comes down to transforming single records one-by-one. Of course the XSLTs are always compiled before performing bulk transformations, in order to make the whole process faster.
Each program is placed in a java package of its own, beginning with ‘org.diligentproject.metadatamanagement.metadatabrokerlibrary.programs’. However, this is just a convention followed for the default programs contained in the metadata broker library. There is no restriction on the package names of user-defined programs. In order for user-defined programs to be accessible by the Metadata Broker, they should be put in JAR files and copied to the ‘lib’ directory under the installation directory of ws-core (or to any directory that belongs to the CLASSPATH environment variable).

Dependencies

  • MetadataBrokerService
    • jdk 1.5
    • WS-Core
    • MetadatBrokerLibrary
    • DISHLSClient
  • MetadataBrokerLibrary
    • jdk 1.5
    • WS-Core
    • ResultSet bundle
    • DISHLSClient
    • Metadata catalog service stubs
    • Metadata catalog library

Usage Example

The following examples use the transformWithNewTP operation of the metadata broker service.

Transforming a single record using a XSLT

This is the GXSLT_Rec2Rec class (included in the metadata broker library), which performs the actual conversion:

package org.diligentproject.metadatamanagement.metadatabrokerlibrary.programs.GXSLT_Rec2Rec;

import org.apache.log4j.Logger;
import org.diligentproject.metadatamanagement.metadatabrokerlibrary.programs.Program;
import org.diligentproject.metadatamanagement.metadatabrokerlibrary.programs.RecordType;
import org.diligentproject.metadatamanagement.metadatabrokerlibrary.programs.VariableType;

import java.io.StringReader;
import java.io.StringWriter;
import java.rmi.RemoteException;

import javax.xml.transform.OutputKeys;
import javax.xml.transform.Templates;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;

public class GXSLT_Rec2Rec implements Program {
	
	private static Logger log = Logger.getLogger(GXSLT_Rec2Rec.class);
	private StringWriter output;
	private static TransformerFactory factory = TransformerFactory.newInstance();
		
	public void transform(RecordType record, VariableType xslt, RecordType outRecord) throws RemoteException {
		try {
	            Transformer t = factory.newTransformer(new StreamSource(new StringReader(xslt.getReference())));
        	    t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
	            output = new StringWriter();
        	    t.transform(new StreamSource(new StringReader(record.getReference())), new StreamResult(output));
	        } catch(Exception e) {
        		log.error("Failed to transform record. Throwing exception.");
        		throw new RemoteException(e.toString());
	        }
	}
	
	public void transform(String record, Templates xslt) throws RemoteException {
	       try {
	            Transformer t = xslt.newTransformer();
	            t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
	            output = new StringWriter();
	            t.transform(new StreamSource(new StringReader(record)), new StreamResult(output));
	        } catch(Exception e) {
	        	log.error("Failed to transform record. Throwing exception.");
	        	throw new RemoteException(e.toString());
	        }
	}
	
	public static Templates compileXSLT(String xslt) throws TransformerConfigurationException {
		return factory.newTemplates(new StreamSource(new StringReader(xslt)));
	}
	
	public String getOutput() {
		return output.toString();
	}	
}

The only transformation method that can be used externally (when this program is called by a transformation program) is 'public void transform(RecordType record, VariableType xslt, RecordType outRecord)'. The other 'transform' method as well as the 'compileXSLT' method are intended to be used internally by other programs which call GXSLT_Rec2Rec during their execution.

This is the XML definition of the transformation program:

	 
<?xml version="1.0" encoding="UTF-8"?>	 
<TransformationProgram>
	<Input name="TPInput">
		<Schema isVariable="true" />
		<Language isVariable="true" />
		<Type>record</Type>
		<Reference isVariable="true" />
	</Input>
	<Variable name="XSLT" />
	<Output name="TPOutput">
		<Schema isVariable="true" />
		<Language isVariable="true" />
		<Type>record</Type>
	</Output>
	<TransformationRule>
		<Definition>
		<Transformer>org.diligentproject.metadatamanagement.metadatabrokerlibrary.programs.GXSLT_Rec2Rec.GXSLT_Rec2Rec</Transformer>
		<Input name="Rule1Input1">
			<Schema isVariable="true"> //Input[@name='TPInput']/Schema </Schema>
			<Language isVariable="true"> //Input[@name='TPInput']/Language </Language>
			<Type>record</Type>
			<Reference isVariable="true"> //Input[@name='TPInput']/Reference </Reference>
		</Input>
		<Input name="Rule1Input2">
			<Schema />
			<Language />
			<Type>variable</Type>
			<Reference isVariable="true"> //Variable[@name='XSLT'] </Reference>
		</Input>
		<Output name="TPRule1Output">
			<Reference>//Output[@name='TPOutput']</Reference>
		</Output>
		</Definition>
	</TransformationRule>
</TransformationProgram>

In this example, the transformation program defined above is stored in the DIS as a profile with UniqueID=ce6b9860-ebfe-11db-8b69-dd428ed9686d. The input record that is going to be transformed is stored in a local file named input.xml, and the XSLT that will be used is defined by the file xslt.xml. The following code fragment reads the contents of these two files, creates a set of parameters which are used in order to assign the input data and the XSLT definition to the respective transformation program variable inputs, and then invokes the transform operation of the metadata broker service. The result is written to the console. The URI of the remote service is given as a command-line argument.

public class Client {
	public static void main(String[] args) {
		try {
			// Get the broker service porttype
			EndpointReferenceType endpoint = new EndpointReferenceType();
			endpoint.setAddress(new Address(args[0]));
			MetadataBrokerPortType broker = new MetadataBrokerServiceAddressingLocator().getMetadataBrokerPortTypePort(endpoint);

			// Read the input data file into a string
			String inputData = readTextFile("input.xml");

			// Read the XSLT file into a string
			String XSLTDefinition = readTextFile("xslt.xml");
				
			// Create a set of transformation parameters, assigning values to variables
			// defined in the transformation program
			TransformationParameters tparams = TransformationParameters.newInstance();
			tparams.addParameter("//Input[@name='TPInput']/Schema", "Schema1=URI1");
			tparams.addParameter("//Input[@name='TPInput']/Language", "en");
			tparams.addParameter("//Input[@name='TPInput']/Reference", inputData);
			tparams.addParameter("//Output[@name='TPOutput']/Schema", "Schema2=URI2");
			tparams.addParameter("//Output[@name='TPOutput']/Language", "en");
			tparams.addParameter("//Variable[@name='XSLT']", XSLTDefinition);

			// Prepare the invocation parameters
			TransformWithNewTP params = new TransformWithNewTP();
			params.setTransformationProgramID("ce6b9860-ebfe-11db-8b69-dd428ed9686d");
			params.setParameters(tparams.getAsString());

			// Invoke the remote operation and write the result to the console
			System.out.println(broker.transform(params));
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
		 
	private static String readTextFile(String filename) throws IOException {
		BufferedReader br = new BufferedReader(new FileReader(filename));
		StringBuffer buf = new StringBuffer();
		String tmp;
		while ((tmp = br.readLine()) != null) {
			buf.append(tmp + "\n");
		} 
		br.close();
		return buf.toString();
	}
}

-- Sboutsis 19:11, 19 March 2007 (EET)