Building a custom web analytics tool using Amazon Cloud

Share
  • January 11, 2019

The advent of a digital age has brought about ubiquitous customers who use myriad channels as part of their purchase process. Building out a consolidated view of these digital interactions in a cost-effective manner is one of the top priorities of senior managers and rightly so.

The implementation options for providing these advanced insights, however, are limited to a set of highly expensive enterprise tools such as Adobe Analytics (SiteCatalyst), IBM Customer Analytics (Coremetrics), WebTrends, and Google Analytics 360. While these tools do provide features to track cross-channel visitor behavior, the total cost of ownership of such solutions (software, hardware, implementation, consulting fees, etc.) is usually prohibitory for large-scale adoption.

In this article, software consultants from Itransition share their experience in building custom web analytics solutions with certain big data stack components from the Amazon Cloud. While this approach certainly involves a higher capital expenditure in terms of software development efforts, we are of the view that the long-term savings in costs and also the highly customized nature of implementations make this option a very promising choice for generating advanced, cross-channel customer intelligence.

The conceptual architecture

As with any software solution, it helps to break down the problem into conceptual building blocks of functionality that are tool-agnostic and will work with any platform (Amazon Cloud, Azure, VMWare, Google App Engine etc.). For our custom web analytics solution, the conceptual architecture consists of 5 building blocks:

#1. The pixel server

Tracking user activity using pixels is a standard practice in digital analytics. Web pages (and other tracked resources) typically contain a pixel tag, and when a browser loads the parent page, the pixel is also loaded and can create a trace of the information that was requested in this hit. If each such hit can be associated with a unique user id and date/time stamp, then it should be possible to aggregate hits at the visitor level.

The pixels must be physically stored somewhere such as a server filesystem, a cluster of servers, or a content delivery network. The pixel server component specifies the physical location that pixels are served from. A well-designed pixel server must have the ability to serve a very large number of pixels with minimal latency, regardless of where the requesting user is located and also without slowing down the parent app.

#2. Data collection engine

Pixel servers are designed to provide a quick serving of static images and can typically not store large amounts of log data. For this reason, the data about pixel hits needs to be periodically flushed out to a more specialized data collection layer. We refer to this data collection layer as the data collection engine. The data is still in its raw form (as in the original pixel server) but is much larger in size as compared to that sitting on pixel servers.

#3. Transformer

This component performs two functions:

  • Constantly fetches raw logs from the pixel server into the data collection engine.
  • Performs the ETL to create final user/session level datasets, which are dumped onto the data storage engine. For this, the transformer implements all the business logic to sessionize data (defines the duration of a session, pulls together records created within that window). Then it rolls it up further into user-level data (combines data from multiple sessions into a single user level record).

#4. Data storage engine

The pixel servers provide hit-level data, which is periodically moved to the collection engine that is designed to store much bigger datasets. The transformer then transforms the raw data into user/session level datasets that all have a certain schema depending upon business requirements.

The data storage engine provides physical storage for the final, transformed data, which can be plugged into business intelligence engines of analytics applications.

#5. Client-side tracker

The client-side tracker is what actually generates the raw hit level data that is sent to the pixel server. When tracking websites, this is typically a piece of JavaScript code that captures information such as page title, page path, referring URL, and other data relating to marking campaigns. These trackers can be designed to send almost any information that is available on a webpage.

Using the Amazon Cloud Platform

Amazon Cloud provides almost plug-and-play tools for implementing each of the conceptual building blocks identified above. Let us see how.

Pixel server

Amazon CloudFront provides a plug-and-play content delivery network to serve as the pixel server. Static pixels could be hosted on Amazon S3 and these get automatically cached to be served from the edge location closest to the requesting browser. CloudFront can be easily configured to store the access logs on Amazon S3. This removes the need to manually migrate raw hit data from the pixel server to the data collection engine.

Data collection engine

Amazon S3 is AWS service that provides near-infinite storage capacity for storing raw text data. With Amazon S3, developers do not have to worry about running out of disk space to store raw logs. Also, the pricing for this service is extremely attractive, which makes it ideal even when logs are petabyte scale and come from multiple corporate servers.

Transformer

The Transformer component implements all the code to convert raw hit data into a format that can be consumed for business reporting and analysis. PIG is a natural technology choice for this if the underlying storage engine uses HDFS. Other options could include technologies such as Talend for Big Data, Pentaho Kettle, and Informatica—all of which are capable tools to perform complex batch transformations on large datasets.

Data storage engine

This will store both the raw data that will be processed by the Transformer and the final transformer output that will be used by end users. Possible implementation options could be Hive (part of Amazon EMR), Amazon Redshift, Amazon DynamoDB, or even just plain Amazon RDS (running MySQL or some other RDBMS). The choice would depend entirely on how the information needs to be processed.

For example, for largely static reporting needs, companies might consider using Redshift as the data sink. For highly interactive, exploratory data analysis (such as in Embedded BI) it might be better to use Amazon RDS. Similarly, when there is a lot of variation in the kind of meta-data that is tracked, it might work to use something like DynamoDB.

Client-side tracker

This really is just plain old JavaScript code and has no relevance to the AWS cloud. Trackers could send simple attributes such as page name, path, referring URL, campaign parameters as well as some very advanced information that is stored in first-party cookies. The trick here is really to define a flexible data model first and then use a plugin style coding paradigm to populate the entities from client-side information.

Summary

The appeal of building a custom solution with the AWS cloud and using components above lies largely in the fact that all the components above can be up and running with almost zero capital investment. Immediate access to almost infinite storage, processing power capability and most importantly, a significantly lower total cost of ownership are just some of the other value propositions that should be objectively considered when making build vs. buy decisions when it comes to implementing advanced digital analytics.

The post Building a custom web analytics tool using Amazon Cloud appeared first on JAXenter.

Source : JAXenter