Experimenting with Big Data

Want to know about the sexiest job of the 21st century? According to a 2012 article in Harvard Business Review, it's the data scientist the person who makes sense of the hoards of data being collected by organisations.

It's not a job title with a long history it was only coined in 2008 by Jeff Hammerbacher and DJ Patil but the notion of someone seeing patterns in data has been around for centuries. When John Snow first honed in on a particular well in Soho as being a primary source of cholera, he was doing what the modern data scientist does the only difference being that the modern data scientist has a much more complex role.

The modern data scientist has to look at data from a variety of sources: from SQL databases, text messages and social media outlets, to name but a few there's even one supermarket that uses store security cameras to monitor customers' behaviour in order to assess how they perceive certain items. It's all about gaining a commercial advantage.

So what are the techniques that can be used to bring together all the relevant information? According to a McKinsey report Big data: The next frontier for innovation, competition, and productivity, there are several techniques that can be deployed to extract information from this mix of large databases and unstructured data.

One of the most common big data techniques is data fusion. This, as its name implies, involves the bringing together of data from a variety of different sources. As we mentioned earlier, these sources can be very diverse and can include sales information, data about customers from CRM software, production figures and so on and that's before we talk about unstructured data from the likes of social media systems.

Another widely used big data technique is crowd sourcing - an attempt to break down the task of sorting through large amounts of data by calling on users to compile and analyse the figures.

There are many other commonly used techniques. Time series analysis, a technique that allows data scientists to plot a series of data points over a specified time period, where the measurements are taken at uniform time intervals at the same time every day for example. Cluster analysis involves sorting data into groups, or clusters, that are similar to each other leading to easier analysis. A/B testing is used to compare different objects against a specific control group this will help determine whether a specific set of actions will lead to a particular objective; while ensemble learning is also deployed to make accurate predictions - it works by considering the interactions between separate constituents of a model. By considering performance as a whole, users can get a more accurate prediction than if they were considering them separately.

But the most radical and, in the long term, most important technique is machine learning, a way of interpreting data by using artificial intelligence to automatically learn to recognize complex patterns and make intelligent decisions based on data. We can see its use now within financial organisations where a range of algorithms can conduct multi-million dollar trades with minimal intervention by humans, but according to Dave Coplin, Chief Envisioning Officer at Microsoft UK, in time, machine learning will start penetrating enterprises at every level. "It won't be long before we see machine learning in normal enterprises, not just in scientific organisations and academia," he says. "Within five years, I expect to see widespread use."

He gave three examples where machine learning is already having an impact in everyday life: "The first example is Xbox games. Take something like Halo 4, a new player would join a multiplayer and if he wasn't very experienced would get killed instantly it's not a good user experience. But it took about 50 games for him to be matched better. By using machine learning and matching patterns, we've got that down to five games," says Coplin.

Further examples that Coplin gives are language translation, where machine learning has radically improved the accuracy of Xbox Kinect. "It used to take a team of developers 200 hours of coding to interpret each movement. By using machine learning, we've reduced that to two hours," he adds.

The range of analytical techniques needs to be based on a robust platform for precise analysis. Anyone who starts exploring big data for any period of time will be drawn towards Hadoop, the dominant standard for big data implementation. Microsoft has been an enthusiastic supporter of Hadoop, working with Hadoop vendor Hortonworks, to roll out big data applications to the wider market. It's been a bit of necessary exercise as Hadoop had a reputation of being a slightly esoteric technology, suitable for highly technical specialists but not relevant in the wider world of business.

Microsoft is looking to crack this issue by offering a couple of products. HDInsight Server is software that combines Hadoop technology with Windows. The idea is that it provides organisations with the opportunity to use familiar software for big data implementations. It means that companies can construct new business applications, using Hadoop, but running them on their tried and trusted Windows platform. HDInsight offers users the same mix of Microsoft tools and Hadoop expertise but in the cloud making it the software that can be deployed with Windows Azure implementations.

The software fits neatly into the Microsoft vision of moving away from product silos to the concept where the same tools can be deployed across different Microsoft software products - for example, PowerView is used by products such as Excel, SharePoint and SQL Server. So, HDInsight Server can provide insights by using tools such as these.

Perhaps, the easiest way to see Big Data techniques is to look at the way they're being deployed by Microsoft customers.

One of the ways in which business intelligence can provide a better service for customers is with motor insurance. Aviva is using BI to offer more personalised policies tailored not just by the usual factors of age, gender and driving history, but by actual performance. Steve Whitby, Solutions Delivery Center Director at Aviva said "Instead, rates would be determined by how you conduct yourself in the car."

To do this, the company developed a mobile phone app that could be used by all drivers for free. After downloading it, the app works with the phones' built-in GPS technology and sensors to collect and send data including braking, cornering, and acceleration behavior to a cloud-based system. After 200 miles of driving, Aviva has enough information to give the driver a quote based on the way that he or she has been driving. The new service, Aviva Drive was officially launched in November last year.

But it goes further than that. The app also connects to social media sites so drivers can share scores with other motorists, encouraging them to be better drivers but also promoting the service to non-Aviva customers.

One of the other sectors that can make extensive use of big data techniques is the health service. Not only do clinical practitioners and health administrators need speedy analysis of data: some of the information being gathered needs to be processed urgently it can literally be a matter of life and death. The issue is that data is collected from a variety of sources - GPs, clinics, hospitals - which may use a variety of different systems and where the data isn't always collected in real time.

But there's a more pressing problem: not all data is in simple structured form. It could be in a clinician's notes, it could be from other non-health sources such as school attendance or social worker reports.

Bolton-based BI consultancy, Ascribe, decided to meet the challenge of handling diverse records and worked to implement a project that would lead to a standardised approach to working with healthcare data. Leeds Teaching Hospitals, which generated half a million records every year in its A&E department alone (not to mention one million unstructured files every month), participated in the project.

The plan was to create a way to monitor infectious disease, locally and on a national level by creating a way for data analysts and clinicians to improve healthcare.

Ascribe's approach was to use a combination of Microsoft SQL Server and a Windows Azure-based hybrid cloud; for data analysis it used Microsoft's HDInsight Service. After amassing a repository of clinical data, Ascribe used the HDInsight Service to create a platform that could handle the large amount of structured and unstructured data, integrating with both Microsoft tools and unstructured sources such as social media. In this particular instance, it is using a natural language technique.

With this system, Ascribe could use the data to identify potential outbreaks of infectious disease and even pick up on trends such alcohol-related incidents and domestic accidents.

Aviva and Ascribe are just two of the many companies that have used these big data techniques to identify issues and to predict the future. In one case it's an example that could save lives, in the other it will save money.

What both cases have in common is that they rely on compiling enough information for patterns to be identified, patterns that may not have been recognisable ten, or even five, years ago. In the future, says Dave Coplin, every company will need its own data scientist to identify that data. He sees parallels in the way that Microsoft has offered Excel users the ability to create Excel macros.

Coplin says that companies need to change their mindset and consider data from outside their organisation too and those departments within companies should be less bureaucratic and less siloed. "You have to look beyond your department and beyond your own organisation," he says. There's plenty of information out there, it just needs to be gathered. But, he warns, this is only half the story: "it's not the data that's important; it's what you do with it." It's a lesson that many companies are beginning to learn.

To register for a Microsoft Business Intelligence Experience Workship click here.

Max Cooter

Max Cooter is a freelance journalist who has been writing about the tech sector for almost forty years.

At ITPro, Max’s work has primarily focused on cloud computing, storage, and migration. He has also contributed software reviews and interviews with CIOs from a range of companies.

He edited IDG’s Techworld for several years and was the founder-editor of CloudPro, which launched in 2011 to become the UK’s leading publication focused entirely on cloud computing news.

Max attained a BA in philosophy and mathematics at the University of Bradford, combining humanities with a firm understanding of the STEM world in a manner that has served him well throughout his career.