When you are only having a few applications, it is easy to integrate them using a point-to-point approach ie, each application connects to all the other applications directly to access the information. As the number of applications increases, using the point-to-point approach will result in a non-maintainable mess of spaghetti-like chaos as shown in the diagram below. You may reduce the integration complexity by either introducing an Enterprise Integration Bus (ESB) or by using Data Virtualization. It should be noted that this is not an either-or choice as ESB and data virtualization are the two sides of the same integration coin. ESB is used for application integration and data virtualization is used for data integration. They work together well. Complexity is not the only issue here, how about administering security? In using a point-to-point approach, you have to administer security or access control on each data source for each application. Also, many organizations store information eg, on a customer, using different applications and the information they store may differ slighly from application to application. This means that, depending on the system or application you query, you may end up with conflicting information on a customer.
Data Virtualization to the Rescue
In this article, I shall present the data integration approach. (I shall present the ESB approach in a future article.) Data virtualization can be used to build standards-based data services to access information in a form useful to business applications and to provide the capabilities to
- combine information contained in multiple data sources into virtual tables in a virtual RDBMS databases
- allow your applications to access all your information as if it is contained in a RDBMS instead of in different formats depending on the source
- allow you to administer access control centrally on the virtual database instead of on individual data sources
- facilitate exposing a canonical data model for all applications to access. This means that all applications get consistent information instead of potentially different information depending on which system you query
Data virtualization is different from the ETL (Extract, Transform, Load) approach in the way it works. The ETL approach always involves copying data from one place to another (data duplication). The data virtualization approach does not and can be used for realtime or near realtime processing. Data virtualization products such as JBoss Data Virtualization(JDV) also provides a large number of connectors out-of-the-box to connect to different data sources including various relational databases, NoSQL databases, web services, CSV files, XML files, SaaSs such as Salesforce, SAP, etc.
The diagram at the beginning of this articles gives a conceptual view of what JDV is. The following diagram depicts how JDV works.
I shall use a top-down approach in the following sections to describe how data virtualization works.
As mentioned earlier, data virtualization allows you to access the combined data or your corporate canonical data model using standard interfaces. In JDV’s case, you may access the information contained in the virtual database using the following interfaces:
- Web services (SOAP/REST) generated automatically by JDV
A JDV virtual database is constructed using models which combines information from various data sources into consistent, useful information to be used by applications via the Access Layer.
In JDV, models are used to define the entities, and relationships between those entities.
Source models define the structural and data characteristics of the information contained in data sources. Each source model describes a single physical data source.
View models define a layer of abstraction above the physical layer (source model), so that information can be presented to end users and consuming applications eg, in the form of an organization’s canonical data model instead of the form that it is physically stored. You may create multiple view models on top of view models to arrive at your canonical data model.
Using JDV, you no longer have to administer access control on individual data sources. You can administer access control from the virtual database. You can control access by role down to the column level of any virtual table in a virtual database. Instead of disallowing READ access to certain columns, you can also specify that certain columns be masked out with certain patterns. Using column masking means that you no longer have to maintain different SQL scripts/statements for different user roles. Many customers refer to JDV’s security features as a Data Firewall.
There are 3 forms of caching provided by JDV to enhance performance. They are:
- ResultSet Caching – caches result sets based on an exact match of the incoming SQL
string and PreparedStatement parameter values if present. By default, result set caching is enabled with 1024 maximum entries with a maximum entry age of two hours.
- Internal Materialization – materialized views are the cached contents of a virtual table. The queries to this virtual table are served from the cached contents instead of from original source(s). Internal materialized views are cached in memory. You can specifies a TTL (Time-To_Live) for individual materialized views. When a TTL expires, JDV will refresh a materialized view.
- External Materialization – same as Internal Materialization except that the materialized views are stored in external RDMS tables. JDV allows you to use its graphical user interface to generate SQL DDL (Data Definition Language) for the materialized views based on view models. You can then create tables in your external relational database using the generated DDL. JDV will take care of populating these tables. Again TTLs can be specified.
Again, JDV provide a one-click facility to generate web services for view models to provide CRUD (Create, Read, Update, Delete) operations. It generates a .war file which you can deploy on the Java Application Server (JBoss Enterprise Application Platform or EAP that comes with JDV) which JDV runs on.
When applications submit queries to a VDB via the access layer, the query engine produces an optimized query plan to provide efficient access to the required physical data sources as determined by the SQL criteria and the mappings between source and view models in the VDB. This query plan dictates processing order to ensure physical data sources are accessed in the most efficient manner.
The connector architecture consists of translators and resource adapters. Resource adapters are used to provide transparent connectivity between the query engine and the physical data sources. A translator is used to convert queries into source-specific commands, and a resource adapter provides communication with the source.
JDV comes with many connectors to connect to various data sources out-of-the-box. Here is an incomplete list:
- Oracle 12c
- Oracle 11g R2
- Oracle 10g R2
- IBM DB2 9.7
- Microsoft SQL Server 2008, 2012
- Sybase ASE 15
- MySQL 5.5
- MySQL 5.1
- PostgreSQL 9.2
- PostgreSQL 8.4
- Teradata 12
- Netezza 6.0.2
- Greenplum Database 4.1
- Ingres 10
- MongoDB v2.4
- Apache Cassandra 2.1
- JBoss Data Grid 6.4
- JBoss Data Services Platform 5.x
- MetaMatrix 5.5.4
- LDAP 3
- SAP NetWeaver Gateway
- ModeShape Hierarchical Database
- XML Files
- XML over HTTP
- SOAP Web Services
- Cloudera CDH 5
- Hortonworks Data Platform 2
- Apache HBase 1.1
- Apache Hive 0.12, 0.13, 0.14
- Google Spreadsheets
- Microsoft Excel 2010
- Microsoft Access 2010
In this article, I described the problems that data virtualization solves and how an enterprise grade JBoss Data Virtualization product works both conceptually and architecturally. In the next installment, I shall present an example of using JDV to aggregate information from 3 different data sources (MySQL, web service and an XML file) into a canonical data model, administer security using JDV’s data firewall capabilities and interact with the virtual database by using the SquirrelSQL tool.