The DatasetUtils object provides methods for flattening (unnesting) Spark's recursive StructType instances.
The partitioning system is designed to support extensible partitioning of RDF data.
The following entities are involved:
- The method for partitioning a
RDD[Triple]is located in RdfPartitionUtilsSpark. It uses an RdfPartitioner which maps a Triple to a single RdfPartition instance. RdfPartition, as the name suggests, represents a partition of the RDF data and defines two methods:matches(Triple): Boolean: This method is used to test whether a triple fits into a partition.layout => TripleLayout: This method returns the TripleLayout associated with the partition, as explained below.- Furthermore,
RdfPartitions are expected to be serializable, and to define equals and hash code.
TripleLayoutinstances are used to obtain framework-agnostic compact tabular representations of triples according to a partition. For this purpose it defines the two methods:fromTriple(triple: Triple): Product: This method must, for a given triple, return its representation as aProduct(this is the super class of all scalaTuples)schema: Type: This method must return the exact scala type of the objects returned byfromTriple, such astypeOf[Tuple2[String, Double]]. Hence, layouts are expected to only yield instances of one specific type.- See the available layouts for details.