Thursday, September 29, 2016

CXF JAX-RS 2.0 - Perfect HTTP Spark Streaming Connector

Even the most conservative among us, the web services developers, will be better off admitting sooner rather than later that Big Data is not something that can be ignored, it has become a major technology in the software industry and will continue becoming even more 'influential' with the Internet of things wave coming in.

Where will it place your typical HTTP service which GETs some data for the users from some data store or accepts some POSTs with the new data ?

While I'm somewhat concerned seeing BigData consumers collecting the sources via a variety of custom optimized protocols and low-level transports like TCP, I firmly believe HTTP connectors should and will play a big role in connecting the WEB users with the Big Data processing chains.

HTTP, being so widely used, is a perfect frontend to the local networks where the nodes process the data, and while HTTP is synchronous for a typical interaction, JAX-RS 2.0 REST services can be quite smart. A variety of typical REST patterns can be employed, for example, a POST request handler with the data to be run through a BigData chain can let the application thread deal with it while respond to the user immediately, offering a link with a job id where the status can be monitored or the results returned from.  Or the handler can rely on the suspended HTTP invocations and start streaming the results back as soon they become available.

I have created a Spark Streaming demo showing some of the possible approaches. This demo is a work in progress and I will appreciate a feedback from the Spark experts on how the demo can be improved.

The demo relies completely on JAX-RS 2.0 AsyncResponse - the typical pattern is to resume it when some response data are available - and what is good it can be suspended multiple times to do a fine grained optimization of the way the service code returns the data. StreamingOutput is another piece - it allows writing the data back to the user as soon as they become available.  FYI, CXF ships a typed analog, called StreamingResponse, you can see how it is being indirectly used in this RxJava Observable test code.

But let me get back to the demo. It shows two types of Receivers in action.  This demo service shows how an HTTP InputStream can be converted to a List of Strings with a custom Receiver making them available to Spark. The service currently creates a streaming context per every request which I imagine may not be quite perfect but my tests showed the service performing quite well when the input set is parallelized - less than a sec for a half of MB PDF file.

Speaking of PDFs and other binary files. One of the service methods uses a beautiful Apache Tika Parser API which is wrapped in this CXF extension. These few lines of code is what it takes to have a service enabled for it to push the content of either PDF or OpenOffice documents to the Spark pipeline (I only added PDF and OpenOffice Tika Parser dependencies to the demo so far). I'm sure you are now starting wondering why Tika API is still not used in your JAX-RS services which parse PDF only with the PDF specific API :-)

I keep getting distracted. Back to the demo again. This demo service is a bit more closer to the real deployment scenario. It uses a default Spark Socket receiver - JAX-RS 2.0 service, being a good HTTP frontend, forwards the HTTP stream data to the internal Spark Streaming TCP server which processes the data and makes them available to a JAX-RS AsyncResponse handler which is also acting as a Socket server. The correlation between a given HTTP request and the Spark output data is achieved with a custom protocol extension. I can imagine it will be easier with an internal Kafka receiver which is something that the demo will be enhanced with to try later on.

In both cases, the demo streams the response data pieces back to the user as soon as they become available to the JAX-RS AsyncResponse handler.

Additionally, the demo shows a CXF JAX-RS Oneway extension in action. HTTP Client will get a 202 status back immediately while the service will continue with processing the request data.

I'm sure the demo will need more work but I also hope there's enough material there for you to start experimenting. Give it a try please and watch for the updates. I think it will be very interesting to see how this demo can also be written with Apache Beam API, check this blog entry for the good introduction.

Enjoy ! 

No comments: