Alert icon
We're changing our privacy policy. This stuff matters.  Learn more  Dismiss

Harvesting and Indexing Toolkit (HIT) Demo

Loading...

Sign in or sign up now!
Alert icon
Upgrade to the latest Flash Player for improved playback performance. Upgrade now or more info.
372 views
Loading...
Alert icon
Sign in or sign up now!
Alert icon

Uploaded by on Feb 3, 2010

A demo of the Harvesting and Indexing Toolkit (HIT - http://code.google.com/p/gbif-indexingtoolkit/) developed by the Global Biodiversity Information Facility (GBIF - www.gbif.org).

The demo shows how a user can harvest and index a dataset in Darwin Core Archive format (DwC-Archive - (http://rs.tdwg.org/dwc/terms/guides/text/index.htm) that has been published by an Integrated Publishing Toolkit (IPT - http://code.google.com/p/gbif-providertoolkit/) and how that information could then be accessed/displayed in a portal, the example portal used in the demo is GBIF's own (http://data.gbif.org). Please note that the tool can also harvest datasets published via three other data publishing protocols/toolkits: DiGIR (http://digir.net/) , BioCASE (http://www.biocase.org/products/provider_software/index.shtml), and TAPIR (http://www.tdwg.org/dav/subgroups/tapir/1.0/docs/tdwg_tapir_specification_2009-­09-08.htm).

To understand the demo:

The demo is divided into 2 parts:

Part 1: Harvesting and Indexing a dataset using the HIT

subpart 1: Synchronise with GBIF Registry - During this operation, a user synchronises with the UDDI registry and collects information about all the available dataset access points available.
subpart 2: Collect Resource Metadata - During this operation, the user chooses a dataset, and performs a metadata update where metadata about the dataset is collected (dataset name, contact information, etc). You can see a quick shot of the IPT, and where the dataset's access point (URI) actually comes from.
subpart 3: Harvest Data Records - During this operation, the user will schedule the harvesting to take place. In the case of DwC-Archive datasets, this consists of two steps: 'Download' and 'Process harvested records'. The result from this operation, is an intermediary text file called harvested.txt that contains only those fields of information that are of interest to the user. You can see a quick shot of a mapping file that determines these fields of interest, please see the project documentation for more information on tailoring the HIT for your own customised index structure/database structure, otherwise understand that the fields of interest are pre-set and cannot be configured directly from the interface in this release (0.9RC).
subpart 4: View Harvesting Statistics - Here the user can evaluate how successful harvesting had been by comparing the number of records harvested against the target number expected. Provided that the user is satisfied with the result, they will likely move on to the next step, otherwise they need to go back and evaluate where problems occurred.
subpart 5: Index Records with Database - In this operations, the intermediary text file from harvesting is synchronised with the database. For all types of datasets (not just DwC-Archive) this step consists of two parts: Synchronise and Extract. During the 'Synchronise' operation, the text file is read over, and the data mapped to the GBIF Data Structure and synchronised into the database into the appropriate tables. During the 'Extract' operation, most validation of the records will take place and the record's taxonomy built in accordance with the GBIF Nub Taxonomy. The result of this step is that the information has been successfully indexed with the database.




Part 2: Accessing the indexed dataset

Here it's possible to see how the information that has been indexed is now accessible via some portal. In this example, the GBIF Data Portal (sitting on top of the index database) is used to illustrate how the information has been indexed, permitting the user to search for the dataset by the name of its publishing organisation, which then allows the user to see the dataset amongst the organisations other datasets, and narrowing in on that datasets then shows additional metadata on that dataset. Not shown is how the user could then have viewed the actual data records or downloaded them, etc.

  • likes, 0 dislikes

Link to this comment:

Share to:
see all

All Comments (0)

Sign In or Sign Up now to post a comment!
Loading...

Alert icon
0 / 00Unsaved Playlist Return to active list
    1. Your queue is empty. Add videos to your queue using this button:
      or sign in to load a different list.
    Loading...Loading...Saving...
    • Clear all videos from this list
    • Learn more