Open Source Tools for Creating Mashups with Government Datasets MOSC2010
Mohammed Firdaus, Muhd Sharuzzamal Bakri
Malaysia Open Source Conference 2010
Archive and slide
http://www.mosc.my/2010/08/open-source-tools-for-creating-mashups.html
http://www.slideshare.net/linuxmalaysia/open-source-tools-for-creating-mashups-with-government-datasets-mosc2010
http://www.scribd.com/doc/34006841/Mosc2010-Mashups-With-Government-Datasets
Open Source Tools for Creating Mashups with Government Datasets
MOSC2010 - Presentation Transcript
Open Source Tools for Creating Mashups with Government Datasets
Mohammed Firdaus, Muhd Sharuzzamal Bakri June 29, 2010 Mohammed
Firdaus, Muhd Sharuzzamal Bakri
Introduction About the Speakers About the Speakers Mohammed Firdaus
bin Mohammed Ab Halim (@firdaus halim) and Muhd Sharuzzamal Bakri
(@amai) Founders of Persada Terbilang Sdn Bhd - We have no
relationship whatsoever to any fertilizer supplier
Introduction What are Mashups? Mashups A mashup is a web page or
application that uses and combines data, presentation or functionality
from two or more sources to create new services. (Source: Wikipedia)
Data mashups combine similar types of media and information from
multiple sources into a single representation. (Source: Wikipedia)
Challenges Data Sets are Not Available in Machine Readable Form Data
Sets are Not Available in Machine Readable Form Nothing useful here:
filetype:csv site:.gov.my filetype:xml site:.gov.my filetype:rdf
site:.gov.my We have to resort to web scraping.
Challenges No Data Dictionaries No Data Dictionaries Since the data
sets that are available were meant for humans to consume rather
machines they are usually published without any type of data
dictionary. This means that an application developer will have to make
assumptions about the structure of each field e.g. whether it's unique,
whether it's a multi-value field, which fields are mandatory/option.
These assumptions may or may not turn out be correct as you see more
and more data in the data set.
Challenges New Data Sets Constantly Become Available New Data Sets
Constantly Become Available This is a not a bad thing. However, our
code, database and schema must be flexible enough to deal with future
data sets that we might want to use in our applications.
Challenges Lack of Standards Across Agencies Lack of Standards Across
Agencies Different identifiers for referring to the same entity. The
lack of common identifiers makes it tedious to combine data sets
together which maybe describing the same entity. MyCoID and MyID are
steps in the right direction.
Challenges Summary In Summary Because of these challenges, we need an
agile method for modeling, storing and processing these government
datasets in our application. The purpose of this presentation is to
show how representing your data as a graph both help you deal with
these challenges and at the same time help make compelling data
mashups. ===
Graphs Introduction to Graphs What is a Graph? A data structure that
consists of a collection of vertices and the connections between those
vertices, called edges. Vertices are sometimes called nodes or dots.
Edges are sometimes called relationships or edges. The terminology
differs between software packages.
Graphs Types of Graphs Types of Graphs A directed graph (or digraph)
is one where the edges have a direction (i.e. there's an outgoing and
incoming vertex). A multigraph is one where multiple edges can exist
between two vertices. An edge-labeled graph is a graph where edges
have labels. Similarly, a vertex-labeled graph is one in which the
vertices have labels. An attributed graph is one in which the vertices
and edges can have attributes (key-value pairs). A graph can have more
than one of these properties e.g. a multi digraph is one which
multiple directed edges can exist between two vertices. Mohammed
Firdaus, Muhd Sharuzzamal Bakri
Graphs Types of Graphs Examples - Social Graphs Source:
http://www.flickr.com/photos/greenem/11696663/ Undirected Graph -
Vertices represent people and edges represents friendship. Mohammed
Firdaus, Muhd Sharuzzamal Bakri
Graphs Types of Graphs Examples - Web Graph
http://en.wikipedia.org/wiki/File:WorldWideWebAroundWikipedia.png
Multi-digraph - Vertices represent web pages and directed edges
represent links between pages.
Graphs Property Graphs Property Graphs 'Property graph' is another
term for attributed labeled multi-digraph. Property graphs are flexible
enough to support most types of graph data. Other types of graphs
(with the exception of hypergraphs) can be built on top of property
graphs by removing features or using features of the property graph in
certain ways. The tools that we are covering in this presentation deal
primarily with property graphs.
Graphs Property Graphs Property Graphs Source:
http://wiki.github.com/tinkerpop/gremlin/defining-a-property-graph
Data Sets Treasury Procurement Data Treasury - Tenders Awarded Source:
http://myprocurement.treasury.gov.my/index.php/en/list-keputusan-tender
Data Sets Treasury Procurement Data Fields Tajuk Tender (Title of
Tender) Nombor Tender (Tendor Number) Kategori Perolehan (Procurement
Category) Kementerian (Ministry) Petender Berjaya (Winner of Tender)
No Pendaftaran Dengan ROB/ROS/ROC (Registration Number with
ROB/ROS/ROC) No Pendaftaran Dengan MOF/PKK (Registration Number with
MOF/PKK) Harga Setuju Terima (Agreed Upon Value)
Data Sets Treasury Procurement Data Code and Data in Machine Readable
Form For this presentation we are using data that we scraped form this
site on 2010-04-26 The source code for our scraper and the CSV dump
from 2010-04-26 is available at http://mfirdaus.com/mosc-paper/ The
dump contains 2615 records.
Data Sets Treasury Procurement Data The Dump
Data Sets Issues with this Data Sets Missing Fields Out of the 2615
records in the dump 510 records were missing a tender number 472
records were missing a category 1836 records were missing a
ROB/ROS/ROC number 510 records were missing a MOF no
Data Sets Issues with this Data Sets Tender Numbers are Not Unique 32
records have the same tender number and title as another record 23
records have the same tender number as another record In some cases
these appear to be duplicate records since the fields all match up. In
other cases, one or two fields are slightly different indicating that
there was a probably a typo (erroneous record was not deleted). In
some cases, the other fields are completely different which leads us to
think that it's possible for there to be multiple winners of a tender
(need some government officials to verify this for us).
Data Sets Issues with this Data Sets Format of Tender Numbers Examples
of tender numbers: 8/2009 PL.(T).08.2009(JKP) X0141110101090021
128/2009 KBS.S.4-14/69 (T.26/2009) Probably not a good idea to write
code that attempts to parse the tender number.
Data Sets Issues with this Data Sets Format of the "Petender Berjaya"
Field SYARIKAT PROSPECTRUM SDN BHD TELEKOM SMART SCHOOL SDN BHD
NO.45-8, LEVEL 3, BLOCK C, PLAZA DAMANSARA, JALAN MEDAN SETIA 1, BUKIT
DAMANSARA 50490 KUALA LUMPUR 1. GLOBAL AEROSPACE SDN BHD (A002) 2.
SYSTEM ALLIANCE TECHNOLOGY SDN. BHD.(A003) 3. KARISMA WIRA SDN. BHD.
(A004) 4. KESUMA TECHNOLOGY SDN. BHD (A005) A QUALITY REPUTATION SDN
BHD B PRIMABUMI SDN BHD
Data Sets Modeling Modeling this Data Set as a Property Graph One way
to model this data as a graph is to: Vertices to represent tenders,
ministries and companies/businesses. An "awarded by" labeled edge to
associate a tender with a ministry. An "awarded to" labeled edge to
associate a tender with the winner of the tender (the
company/business). Attributes on tender vertices for the tender title,
number, value, category Attributes on company/business vertices for
the company/business name, ROB/ROC/ROS registration number and MOF
registration number. Attributes on ministry vertices from the name of
the ministry.
Data Sets Modeling Example
Graph Databases and Neo4j Neo4j - Introduction Neo4j Neo4j is a graph
database. Persists data in graph form. Property graph data model with
the exception of vertex labels. In Neo4j terms, vertices are nodes,
edges are relationships and attributes are properties. Property values
can be a String or any Java primitive (arrays of these types are
supported as well). Licensed under the AGPLv3. Which basically means
that you don't need a license if your application is released under a
compatible free software license. For other uses, you need a
commercial license from them.
Graph Databases and Neo4j Neo4j - Introduction Neo4j Written in Java.
Bindings available for Python, Ruby, Clojure, Erlang, Groovy, Scalan
and PHP. We will be using the Python bindings in this talk. An
embedded database, meaning that it runs in the same process space as
the application. There's a standalone REST server for those who prefer
it.
Graph Databases and Neo4j Inserting into Neo4j Initializing the
Database import neo4j db = neo4j.GraphDatabase("db")
Graph Databases and Neo4j Inserting into Neo4j Creating the Nodes
ministry node = db.node(name=ministry, type="ministry") entity node =
db.node(name=entity name, no=entity no, mof no=entity mof no,
type="business entity") tender node = db.node(no=tender no,
title=tender title, category=tender category, value=tender value,
type="tender")
Graph Databases and Neo4j Inserting into Neo4j Creating the
Relationships tender node.awarded by(ministry node) tender
node.awarded to(entity node) ===
Graph Databases and Neo4j Inserting into Neo4j Indexing Nodes
ministries = db.index("ministries", create=True) business entities =
db.index("business entities", create=True) tenders by no =
db.index("tenders by no", create=True) tenders by title =
db.index("tenders by title", create=True) tenders by no[tender no] =
tender node tenders by title[tender title] = tender node
Graph Databases and Neo4j Inserting into Neo4j The Result
Graph Traversals Traversing the Graph Traversing is the process of
walking around the graph.
Graph Traversals Graph Traversal Options Graph Traversal Framework
Gremlin SPARQL Manual traversal
Graph Traversals Problem Lets use graph traversal to find all the
companies who have been awarded contracts by Kementerian Kesihatan.
Graph Traversals Graph Around Kementerian Kesihatan
Graph Traversals Traversal Framework Defining the Traversal # Companies
who have gotten contracts from a particular ministry # The start node
is a ministry class Contractors(neo4j.Traversal): types =
[neo4j.Incoming.awarded by, neo4j.Outgoing.awarded to] order =
neo4j.DEPTH FIRST stop = neo4j.STOP AT END OF GRAPH def
isReturnable(self, position): if position["type"] == "business
entity": return True else: return False
Graph Traversals Traversal Framework Using the Traversal with
db.transaction: moh = ministries["KEMENTERIAN KESIHATAN"] contractors
= Contractors(moh) for c in contractors: print c["name"]
Graph Traversals Traversal Framework Output RAF SYNERGY SDN BHD
PRIMABUMI SDN BHD AVERROES PHARMACEUTICALS SDN BHD QUALITY REPUTATION
SDN BHD UNISENDO SDN BHD PRESTIGE PHARMA SDN BHD PHARMANIAGA LOGISTICS
SDN BHD IDAMAN PHARMA SDN BHD PHARMASERV ALLIANCES SDN BHD
Graph Traversals Traversing Graphs with Gremlin Gremlin Gremlin is a
graph based programming language. Can express complex graph traversals
concisely. Available at http://wiki.github.com/tinkerpop/gremlin/
Graph Traversals Traversing Graphs with Gremlin Traversing the Graph
with Gremlin $ ./gremlin.sh ,,,/ (o o) --–-oOOo-( )-oOOo--–- gremlin>
$ := g:key("ministries", "KEMENTERIAN KESIHATAN") ==>v[66] gremlin>
./inE[@label="awarded by"]/outV/ outE[@label="awarded to"]/inV/@name
==>PHARMASERV ALLIANCES SDN BHD ==>IDAMAN PHARMA SDN BHD
==>PHARMANIAGA LOGISTICS SDN BHD ==>PRIMABUMI SDN BHD ==>PRESTIGE
PHARMA SDN BHD ==>UNISENDO SDN BHD ==>PRIMABUMI SDN BHD ==>QUALITY
REPUTATION SDN BHD ==>AVERROES PHARMACEUTICALS SDN BHD ==>PRIMABUMI
SDN BHD .....
Graph Traversals Traversing Graphs with Gremlin Explanation
./inE[@label="awarded by"]/outV/outE[@label="awarded to"]/inV/@name
inE - incoming edges outV - outgoing vertices outE - outgoing edges
inV - incoming vertices
Graph Traversals Traversing Graphs with Gremlin Explanation
./inE[@label="awarded by"]/outV/outE[@label="awarded to"]/inV/@name
Graph Traversals Traversing Graphs with Gremlin Explanation
./inE[@label="awarded by"]/outV/outE[@label="awarded to"]/inV/@name
Get current object (.) (the 'KEMENTERIAN KESIHATAN' node). Get the
incoming edges labeled "awarded by" (inE[@label="awarded by"]). Get
the outgoing vertices of those edges (outV) (the contract nodes). Get
the outgoing "awarded to" edges of the contract nodes
(outE[@label="awarded to"]). Get the incoming vertices of those edges
(inV) (the business entity vertices). Get the name attributes of those
vertices (@name).
Graph Visualizations Gephi Gephi Photoshop for graphs. Supports for
various graph layout algorithms. Graph metrics supported - clustering
coefficient. pagerank, diameter, betweeness centrality, closeness
centrality File formats supported - csv, graphml, gexf etc..
http://www.gephi.org
Graph Visualizations Gephi
Mashing Up Adding External Data Sources Mashing Up Lets add
shareholding data from Suruhanjaya Syarikat Malaysia (SSM) to the
graph so that we can show the tenders that have been awarded to
Telekom Malaysia BERHAD and any of its subsidiaries/associate
companies.
Mashing Up Adding External Data Sources Connecting Telekom Malaysia
Berhad and Telekom Smart School Sdn Bhd telekom = business
entities["TELEKOM MALAYSIA BERHAD"] telekom smart school = business
entities["TELEKOM SMART SCHOOL SDN BHD"] telekom multi media =
db.node( name="TELEKOM MULTI-MEDIA SDN BHD", no="345420-H",
text="TELEKOM MULTI-MEDIA SDN BHD", type="business entity")
telekom.shareholder in(telekom multi media, units=1650000) telekom
multi media.shareholder in(telekom smart school, units=7650000)
Mashing Up Adding External Data Sources Graph Centered at Telekom
Malaysia Berhad
Mashing Up Adding External Data Sources Graph Centered at Telekom
Smart School Sdn Bhd
Mashing Up Traversing to Find Direct/Indirect Awards The Traverser
class AllTendersDirectIndirect(neo4j.Traversal): types =
[neo4j.Incoming.awarded to, neo4j.Outgoing.shareholder in] order =
neo4j.DEPTH FIRST stop = neo4j.STOP AT END OF GRAPH def
isReturnable(self, position): if position["type"] == "tender": return
True else: return False
Mashing Up Traversing to Find Direct/Indirect Awards Executing the
Traverser and the Output Executing the Traversal Definition telekom =
business entities["TELEKOM MALAYSIA BERHAD"] tenders =
AllTendersDirectIndirect(telekom) for tender in tenders: print
tender["no"] Output 30/2009 35/2009 8/2009 162/2009 JASA/OP/1/2009
Wrapup Making this Easier
--
I love Aardvark! Join my network so we can help each other out...
http://vark.com/s/foGQ
My Facebook
http://www.facebook.com/linuxmalaysia
My Blog
http://blog.harisfazillah.info/
My Network
http://www.facebook.com/Bukan.Sekadar.Internet.Sahaja
------------------------------------
To subscribe :-
dunia-digital-subscribe@yahoogroups.com
Kongsikan maklumat di Web
http://vark.com/s/foGQ
http://www.facebook.com/pages/Dunia-Digital/216290990218
http://dunia-digital.grouply.com/
http://dunia-digital-linuxmalaysia.blogspot.com/
http://blog.harisfazillah.info/
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/dunia-digital/
<*> Your email settings:
Individual Email | Traditional
<*> To change settings online go to:
http://groups.yahoo.com/group/dunia-digital/join
(Yahoo! ID required)
<*> To change settings via email:
dunia-digital-digest@yahoogroups.com
dunia-digital-fullfeatured@yahoogroups.com
<*> To unsubscribe from this group, send an email to:
dunia-digital-unsubscribe@yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/