Saturday, July 05, 2014

Xebia Blog: Create the smallest possible Docker container

Yesterday, I published a new post on my employer's weblog, called Create the smallest possible Docker container:

http://blog.xebia.com/2014/07/04/create-the-smallest-possible-docker-container/

TL;DR Creating the smallest possible Docker container is simple. There is a standard base image available called scratch. However, if you want to do something with the scratch container, you need to provide it with an executable that has no dependencies. This blog post explains how to create such an executable using GoLang. It also describes how to control Docker from withun Docker and how to create a Docker container that creates a Docker container.

Sunday, July 10, 2011

OrientDB: Connecting, writing and reading data

In my last post, I made a quick comparison of the performance of OrientDB, MongoDB and CouchDB. Since then, I received some requests for source code. Readers seemed most interested in OrientDB. This might be explained by the fact that there already is a lot of information available on MongoDB. The attention for MongoDB is understandable. It is a great and efficient product.

If you like MongoDB, you might also like OrientDB. It shares many of the characteristics and adds extra features, like a friendly REST interface and a GraphDB. As a first step, giving some extra attention to OrientDB, here is the source code for the last post. It is a bit crude, but gives a good impression of the simplicity of working with OrientDB in its most basic form.

This code reads a comma (semicolon) separated text file (CSV), breaks it up without taking care of escaped characters (semicolons within the text fields) and store all fields in OrientDB. It starts reading the first line of the CSV to serve as field names. It translates the field names from PascalCasing to camelCasing and removes a vendor-specific prefix. This step might need some changes for your own CSV file. I cannot provide my test data for legal reasons.

After that, it reads every line from the file and stores all the values under the field names from the first row. If there is an extra semicolon on that line, there will be errors in the data. In a more realistic example, such errors could be detected by comparing the split line with the number of field names from the first line. The code in this example should be used with a clean CSV file to prevent these errors.

package eu.adriaandejonge.orient;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;

import com.orientechnologies.orient.core.
db.document.ODatabaseDocumentTx;
import com.orientechnologies.orient.core.
record.impl.ODocument;

public class WriteOrient {

public static void main(String[] args) {
try {
long start = System.currentTimeMillis();

File file =
new File("E:/Development/uitjes.csv");

FileReader reader = new FileReader(file);
BufferedReader bufferedReader = new BufferedReader(reader);

String firstLine = bufferedReader.readLine() + "";
String[] columns = firstLine.split(";");
int length = columns.length;
for (int i = 0; i < length; i++) {
columns[i] = columns[i].replaceAll("w3s_", "");
columns[i] =
columns[i].substring(0, 1).toLowerCase() +
columns[i].substring(1);
}
ODatabaseDocumentTx db =
new ODatabaseDocumentTx("local:/tmp/demo").create();
int cnt = 0;
String line;
while ((line = bufferedReader.readLine()) != null) {
cnt++;
if(cnt % 100 == 0) System.out.println("cnt=" + cnt);
ODocument uitje = new ODocument(db, "uitje");
String[] values = line.split(";");

for (int i = 0; i < length; i++) {
if (i < values.length)
uitje.field(columns[i], values[i]);
}
uitje.save();
}
db.close();
System.out.println("DONE in " +
(System.currentTimeMillis() - start) + "ms");

} catch (Exception e) {
e.printStackTrace();
}
}

}


To estimate the amount of code needed to communicate with OrientDB, you should focus on the bold lines. The rest of the code only serves to read the CSV files. The bold code is comparable to code for similar NoSQL databases, like MongoDB and CouchDB. Even though there is no standardized API for these databases yet, you do not have to worry about lock-in too much. As long as you isolate the database specific code, you can easily migrate to a different datastore as long as it shares the same characteristics. OrientDB, MongoDB and CouchDB can all be characterized as NoSQL document storages that are particularly well suited for storing JSON documents with nested key-value pairs.

Reading data from OrientDB is somewhat similar. You can do a lot more than demonstrated in the code example. Querying and reading specific fields to name the most basic examples. What the code demonstrates, is that if you simply want to serve JSON documents to the outside world for client side processing, you don't need to write a lot of code.

package eu.adriaandejonge.orient;

import java.io.FileWriter;
import java.io.IOException;

import com.orientechnologies.orient.core.
db.document.ODatabaseDocumentTx;
import com.orientechnologies.orient.core.
record.impl.ODocument;

public class ReadOrient {

public static void main(String[] args) {
try {
tryOrient();
} catch (Exception e) {
e.printStackTrace();
}
}

private static void tryOrient() throws IOException {
long startTime = System.currentTimeMillis();
ODatabaseDocumentTx db =
new ODatabaseDocumentTx("local:/tmp/demo")
.open("admin", "admin");
readCollection(db);

System.out.println("DONE in " +
(System.currentTimeMillis() - startTime) + "ms");
}

private static void readCollection(ODatabaseDocumentTx db)
throws IOException {
int count = 0;
FileWriter fileWriter = new FileWriter("E:/orient.txt");

for(ODocument doc : db.browseClass("uitje")) {
count++;
fileWriter.write(doc.toJSON() + "\n");
}
System.out.println("# " + count);
}
}


To test these examples, you need to set up a local instance of OrientDB with a default set up. Also, you need to copy the JAR files called orientdb-core.jar and orient-commons.jar to your /lib folder. When connecting to a remote server, you require two additional JARs, orientdb-client.jar and orientdb-enterprise.jar. More details on the libraries required to connect can be found in the OrientDB documentation.

This is just a small first step towards an actual application. Let me know what you think of it. Suggestions for improvement and follow-up posts are welcome.

Thursday, May 19, 2011

MongoDB, CouchDB and OrientDB - a quick comparison

There are many NoSQL databases, each with their own specific purpose and characteristics. This week, my colleagues and I have been looking for a database that helps store a set of simple JSON documents without a lot of hassle and overhead.

Although Cassandra and Hadoop are fascinating products with lots of potential, they seemed to be designed to solve different problems. The long list quickly narrowed down to MongoDB, CouchDB and OrientDB. This was more due to time constraints than a lack of choice. So that leaves many products in the NoSQL world for follow-up posts!

The quick comparison consisted of three questions:
  1. How easy is it to set up the server and connect to it using a Java API?
  2. How fast can you store 9300 records in JSON format?
  3. How fast can you read 9300 records in JSON format?

Please note that this is not a representative test. It is just a quick scan investigating the first results not hindered by any background knowledge and not performing the optimizations like an expert would. This means that the end result only says something about the quality out of the box, not about the ultimate limitations of the product. It must also be mentioned that all tests were performed on Windows XP, 32 bits. Perhaps this is not the best suitable environment for NoSQL.


These are our findings:

OrientDB

The OrientDB is relatively simple to install. The client API consists of two JARs - less than 850KB in total. The only hurdle in setting up a connection is determining what the connection string should be. It turns out that you need to explicitly create a database before you can connect to it. That is not unreasonable of course.

Writing 9300 records: ± 2,640ms (2nd place in this post)
Reading 9300 records: ± 5,157ms (2nd place in this post)

CouchDB

Installing CouchDB is painful from the start. After downloading, you need to make the server yourself. Or you can download an "unstable" installation package (just quoting the website). Setting up the client required the installation of nine JARs, 1.5MB in total. The CouchDB website suggests using HttpClient 4.0 which results in a ClassNotFoundException. After downgrading to HttpClient 3.1, the client starts working.

Writing 9300 records: ± 45,340ms (3rd place in this post)
Reading 9300 records: ± 196,985ms (3rd place in this post)

MongoDB

MongoDB is one of the most pleasant servers I have ever installed. The only hurdle during startup is that you need to create a /data/db directory on your hard drive. After that, anything is automated for you. The daemon starts up in an eye blink. Databases and collections are automatically created when necessary. The client API consists of a single JAR of 240KB and provides a clean and simple API.

Writing 9300 records: ± 1406ms (1st place in this post)
Reading 9300 records: ± 2140ms (1st place in this post)

Conclusion

In terms of ease of use and response times in the out-of-the-box, unconfigured, unoptimized installation, MongoDB clearly wins. And CouchDB loses in all possible ways.

OrientDB should not be disregarded too quickly though. It offers more interfaces and functionality out of the box than MongoDB. For example, for a user-friendly REST interface, MongoDB relies on 3rd party add ons while OrientDB offers a nice one out of the box. Another fundamental difference is that OrientDB also offers GraphDB functionality in addition to the document storage. Depending on the requirements, that may justify accepting the performance penalty which is significant but not dramatical.

If you're looking for absolute efficiency in terms of both performance and ease of use, go with MongoDB!

And there is still a lot more to find out about these products beyond the simple tests in this post. Keep an eye open for follow-up posts!

Tuesday, December 07, 2010

Google Search Appliance for Structured Data

Today was the go-live for ANWB's new Car Portal. The main application we call "Search & Compare". Even if you don't speak dutch, you should be able to try it on http://www.anwb.nl/auto and click on the banner in the upper left of the screen. (or go directly to this link)

The technology for this is based on the Google Search Appliance (GSA). It contains approximately 200,000 cars with many detailed metadata fields to search on. Most people only consider the GSA for unstructured text search. The new Search & Compare application proves that GSA also works well for structured queries. For more documentation on how to do this, see this link on Google's site.

One of the advantages of GSA is easy maintenance and administration. You don't have to think too much about complicated configurations specific for indexing and search solutions (like you need for Lucene). All you have to do, is to create an XML content feed containing lots of meta fields as described in Google's documentation.

Then the last step is to create your own front end. I have to admit: that does require a lot of work. With just an XSLT running inside the GSA, you won't get the results we get in our Search & Compare application. What you can do, is ask the GSA for output in XML by adding &output=xml or &output=xml_no_dtd to the GET request. Then you can use XSLT in your own front-end to create your own screens!

Saturday, July 17, 2010

Answering mail #12

States Mphinyane wrote:

I was recently on the IBM site and you were listed as one of the expects on XML and that seems to be area that i have serious problems with.

The Problem

I am an intern at Botswana Harvard Partnership which is a research institute in matters relating to HIV/AIDS. Currently we are running a study in noe village just outside the capital city. In this study, we use GPS devices so that we can know to which households we have been to. Our job is to pull the way points off of these GPS's and the data comes out the form of an XML file.We do so by using a software called GPS babel which are installed on Ubuntu.We compile all these XML and to come with one XML file to show us the coverage of the study so far because this software has a google maps functionality. Now we do not want to keep appending new XML code each the GPS's come back from the field becaues it can get very tricky and confusing working with thousands of lines of code. We want edit the XML file for each device so that the researchers can return directly to the households they had been to previously.We want to be able to trace our way back to each household. I don't even know if i'm making any sense but that's the best way i can explain it.If you can manage you can give me a call at 76XXXXXX,but of course you would need an extension +267. Thank you in advance.

Hello States,

I am not entirely sure if I understand the question well. This is what I got from the mail:

1. You have multiple people, each with his/her own GPS device
2. The GPS device collects data where you have been
3. The GPS device allows you to revisit places you have been earlier
4. Each person only revisits places he/she has been before, not places visited by a colleague
5. GPS devices can export data in a format readable by GPSBable (not sure which format and whether it is XML or not)
6. GPSBable outputs KML (most probably), readable by Google Maps
7. The GPSBable translations are added to one large KML file by hand
8. The KML file is growing so large that editing by hand is no longer desirable

Am I right so far? (I have the most doubts at point 4.)

Now, if I understand well, you are looking for a way where you can just input individual, smaller XML (probably KML) files to a data repository and have the large all-in-one KML file generated automatically. Correct?

In this case, I would recommend using eXist-db, an open source, native XML database available at http://exist.sourceforge.net/

Using this product, you could add the XML files using a WebDAV connection and query for the large overview file using XQuery.

Then where I am not sure: do you also need to transfer the stored KML files back to the GPS devices? Or is that unnecessary because of point 4.?

Let me know if I understood well and let me know if you need more pointers on using eXist. If you think it could help, please let me know more about the GPS devices and the data format they export.

Kind regards,

Adriaan.

Tuesday, July 13, 2010

IBM developerWorks: "Query XML documents outside an XML database"

IBM published my latest article today:

www.ibm.com
Processing XML in Java usually requires a lot of code and overhead. If you use XQuery, you can do a lot more with a lot less code, even when the XML is stored outside of XML databases. Learn how to use XQuery with Java technology by extracting the hidden information from XML-based Maven POM files.