Incredible Free Resource: An Introduction to Statistical Learning

I recently came across a wonderful suite of materials for introducing statistical learning:

  • Hastie, et al’s free textbook (link to the PDF can be found on this page).
  • The accompanying lecture videos – 15 hrs in total – freely available through YouTube (outline of, and links to, the videos here).
  • Additional slides provided by professor Al Sharif (here), including PDF documents of R scripts and explanations for a wide range of topics covered in the book.

To give folks a feel for the content, it addresses many of the techniques presented at the University of Queensland’s graduate-level Machine Learning course. It also addresses many of the techniques I used, along with colleagues, at Shell to help optimise their massive coal-seam gas business in Brisbane, Australia.

Make Analytics Boring Again: Problem Definition

In the aftermath of the global financial crisis some experts proclaimed that a key issue was a need to ‘make banking boring again.’ That, essentially, money and drugs and Ferraris were the crux of the motivation behind the risk taking that led to the GFC.

Relatedly, I tend to believe that a lot of boring engineering issues actually generate a lot of value. But they aren’t sexy. So they often get ignored by relative newcomers. Eventually those newcomers will get bit by the risks associated with skipping the unsexy bits. Then they, too, will write relatively obscure blog posts about them.

Until then, I’d like to highlight a key issue for helping ensure data analytics projects are set up for success: clear definition of the problem to be solved.

Note that it is ‘problem definition’ (not ‘tool definition’). Which means that the problem definition does not dictate if machine learning will or will not be used. It just requires that the problem be stated so engineers can determine effective options for solving it.

Social Networks and the Radicalisation of Some – Exhibit B

This example shows how some come to believe the earth is flat due to YouTube video recommendation algorithms.

Last week I came across another great example that highlights how some are getting radicalised on social networks. This time the example is from YouTube and the radicals are flat Earthers.

BBC has the reporting – here’s the video.

OMSCS at Georgia Tech for Technical Development of Data Scientists

I’ve looked at and used a range of training resources for software engineers and machine learners. For some, the quality can be poor and the value dubious.

But one resource that really stands above the rest – this one is world class, with very high value, is the OMSCS at Georgia Tech.

If you don’t know already, OMSCS stands for the Online Masters of Science in Computer Science. Georgia Tech has partnered with Udacity to deliver courses online for those who want to study – at the post-graduate level – computer science. If you take 10 courses the will give you a MS in CS. But keep in mind it is one of the top computer science departments in the US, and is not easy.

Their mission is to make education accessible to more people, and they do so by charging the fees at cost. The entire degree costs ~US$7k.

The program is part-time – the most aggressive schedule would be 5 courses in a 12 month period – so can be completed in 2+ years.

If an engineer has some background in basic object oriented programming, networking, relational databases and some prior exposure to memory management and Python, then it should be possible to get through the course work. (Check out the OMSCS subreddit for admission stats)

At one course per semester, it would be a ~3 year effort. And would comprise a significant technical development path for any relatively junior machine learner.

Not too bad for technical development. USD$7k and 3 years, part-time!

Machine Learning for Managers – What to Know

I’ve been a part of several teams now who have considered, or are actively considering, how they might incorporate machine learning (ML) into their business as usual. In these cases I’ve seen some predictable, and avoidable, misunderstandings.

I’ve been a part of several teams now who have considered, or are actively considering, how they might incorporate machine learning (ML) into their business as usual. In these cases I’ve seen some predictable, and avoidable, misunderstandings.

Misunderstanding #1: ML alone will precipitate gold from voluminous corporate databases.

It can be tempting for decision makers to think of ways to leverage their resources for value. Some, quite innocently, look at corporate databases and conclude ‘there must be value in there’, and assume ML will draw that value out.

This error can be quite frustrating and costly if it lives for too long among influential decision makers.

A more productive process for creating value – at least as far as ML is concerned – is to examine where the team is spending the most amount of time and effort. Are any of those activities repetitive? Or do any require an experienced person to gather a lot of data to make a determination? Of you answered ‘yes’ to either of those questions, then it is possible – though not certain! – ML could help you.

Examples of where ML has been quite useful:

  1. Automating the diagnosis of mechanical failures in oil wells given static and time data.
  2. Automating MRI interpretation for radiologists.
  3. Determining if a customer is likely to buy a product based on their demographic information (as they walk into the store!).
  4. Using a person’s alcohol purchasing history to determine their voting behavior in US presidential elections.
  5. Automatically interpreting oil well drilling information to raise alarms to drilling engineers.

Misunderstanding #2: ML has no overhead above and beyond the final ML analysis.

I’ve heard a famous machine learning professor – who consults regularly – say there’s no point to explore ML unless there’s already a data warehouse available. If you don’t know if your company has a data warehouse, then it probably doesn’t.

A data warehouse is a database that pulls in information from many different sources, and can be accessed using regular software tools across the organisation. The warehouse may be in Amazon AWS, or Google Cloud Storage, or could be a solution like the PI from OSIsoft (common in industrial plants, factories and oil/gas assets).

A data warehouse provides a foundation where data importing and cleansing can be automated. Once established, a data warehouse is the natural place to deploy a machine learning model (i.e. where you can generate newly calculated values in your database) and make that available to the organisation through established data warehouse tools.

Misunderstanding #3: ML is supposed to be quick and cutting-edge. It doesn’t need laborious, manual, work.

Some ML requires a reliable ground truth dataset – called training data – so manually establishing a reliable ground truth can be beneficial to an ML project. Such ML techniques are called ‘supervised learning’ techniques – and the results of the models are only as good as the training data. So investing in very high quality training data can be valuable.

Take ML algorithms that automate the interpretation of MRI scans, for example. It is easy to see why errors should be avoided – so training the algorithm on many images with known diagnoses would help ensure the resulting ML model is robust. But that means time and cost are required to generate the input training data before a suitable ML model can be built.

Misunderstanding #4: ML should be used to do cutting-edge work in our non-computer science field, not boring repetitive tasks like reporting.

In reality, ML can really shine when used to automate categorisation work currently being done manually.

I once knew a company that hired 21 engineers (3 teams of 6, and their leaders) to essentially do three things repeatedly:

  1. Push wells to produce as much natural gas as possible.
    • If the well was already producing, i.e. was ‘up and running’, the options for the engineer was ‘speed up the pump’, ‘slow down the pump’ or ‘do nothing’.
  2. Diagnose why an off-line well had failed, so that it can be repaired properly.
    • While there were ~60 various ways the wells could break, 3 of the failure mechanisms comprised ~85% of all failures!
  3. Fill out paperwork related to issue #2 so that the well could be repaired.
    • The paperwork was very repetitive.

In total, this activity was worth about $25M/year in revenue and governed $240M/year in costs. So employing people to do the task was a no-brainer.

Even so, and it may already be obvious, but this line of work was ripe for automation using ML. All production optimisation, diagnostics, and paperwork could be automated by a single python engineer and overseen by a single legacy engineer. Indeed, if the ML model was watched carefully, the results would have been faster, more accurate and more reliable – and would never take a vacation day!

Note: You might think the other 16 engineers (and 3 team leaders) would be made redundant – but they may not be! They could be re-allocated to other valuable work internally in the organisation. It just depends on the opportunities available.

So yeah, those are the headline issues I’ve seen when decision makers are considering ML in their workflows. Just remember – ML, just like other techniques at work, is not magic! It is only a tool to help here and there. At times it can help a lot! But it does require some basic technical and organisational support.

Having Good Information Available Only Matters if You (Can) Use It

In business it is easy to mistakenly think that having insights means they will be leveraged. I mean, if a team delivers analytical insights, the insights would be used, right?

In reality, insights only matter if (1) they are heard/understood, and (2) they are incorporated into decision making.

Regarding the first challenge – hearing/understanding insights – I’ve seen good work produced by an organisation that isn’t heard. And reasons for not hearing are varied:

  1. The audience is too busy with life-or-death activity to care about insights.
    • Organisations that are high in turn-over, or struggle to balance workloads are at risk of this issue.

  2. No one presents the insights to an appropriate audience.
    • Presenting the results and educating the organisation should be seen as a critical part of any analytical result.

  3. The insights are not clear enough for the audience.
    • This often happens because the presenter isn’t clear on the insights themselves.

  4. The audience doesn’t want to listen to the presenters due to information overload.
    • This can happen when organisations do a poor job of managing meetings, which can result in information overload with front-line employees.

  5. The audience doesn’t want to listen to the presenters due to organisational challenges between teams or departments.

Similarly, there are many reasons why insights are well known, but are not incorporated into decision making. But that’s for another blog post.

Social Networks and the Radicalisation of Some – Exhibit A

This being a technology blog, some might balk at sharing political news. Unfortunately, these days, the analytics / machine learning / social networks have produced platforms that routinely radicalise some members of Western society. And that radicalisation sometimes gets expressed violently, which is inherently political.

Here’s a link to a Talking Points Memo reader email, from El Paso, Texas, who is a little confused (and more than a little sad) about the fact that a white supremacist drove the 9 hours from the Dallas/Ft. Worth area to El Paso just to shoot Hispanic folks.

The key bit:

It feels more like terrorism and less like a madman, or a troubled individual. When someone shoots up their own community or high school, it’s personal, it’s in some ways about vengeance and self hatred and lashing out against the environment that made them.
This feels different. This is purely white terrorism.

Anonymous Reader from El Paso, Texas

The shooting suspect reportedly got radicalised through activity on the social network 8chan. 8chan became popular after the rules at 4chan were too restrictive for some.

On Social Media: Is Content Sourcing the Issue?

Now imagine you run a social media company and you’re faced with all of the issues associated with modern social media companies. What do you do about it?

Most big name social media companies work, generally, the same way:

  1. Provide a platform that:
    • Makes it easy to generate content
    • Allows users to connect to others
    • Provides a feed for each user to consume
  2. Harvest the content generated across the platform to create a feed specific to each individual user. Optimise the user’s feed to maximise ad revenue.
  3. Repeat over and over.

There are a few things to notice about this process:

The social media company can improve how users generate content. Content generation can become easier, or more varied.

The process lends itself to experimentation. The platform can experiment on a subset of users with a feature, only to roll it out to others later.

If a user’s feed is optimised for maximum ad revenue, either engagement time or content can be adjusted, assuming some content is more ad-revenue-enhancing than other content.

The source of the content does not appear to matter. The content may or may not be: factual; reliable; from the user’s network of connections.

Notice that the first two observations are rather benign. Experiments are fine. Easier and more varied content generation is fine too.

But the other two observations can be quite a social challenge. In fact, once they are said aloud, it is easy to think of many examples of each one being a well-known issue. We’re all familiar with Fake News. And, after all, couldn’t less commercial content be important? Considering that increased use of social media is correlated with poorer mental health.

Zeroing in on Content Sourcing

If a social media platform were to begin vetting sources, how might they go about it? One idea: limit profiles to actual people posting normal things (i.e. eliminate the use of fake profiles and people that are a part of influence campaigns). It could be implemented with a series of policies:

  1. Hibernate unused accounts (so they are effectively removed from the social graph).
  2. Hibernate accounts with bot-like activity (such as suspicious post rates/frequencies, geographical metadata, etc.).
  3. Hibernate accounts that are frequently flagged for objectionable content. (I recognise this one is trickier.)

As I see it, these kinds of policies risk negatively impacting ad revenue or stock value.

I’m curious, is there public data that shows the impact of such experiments? In other words, has anyone experimented what happens to ad revenue if suspicious activity is limited?

What If the Company’s Value Wasn’t Based on Advertising or ‘Number of Active Users’?

Now imagine you run a social media company and you’re faced with all of the issues associated with modern social media companies. What do you do about it?

You might wonder if it were possible to reshape the organisation to make money from something like subscriptions, financial transactions or e-commerce. If so, you reason, the risks associated with the legacy (ad-based) business would be irrelevant.

Are there any social media companies actively exploring other business models? The only one I know of is the big-blue-one-that-shall-not-be-named.

I wonder if they’ve done the research about user generated content.

Oh Python, How I Love You (Until I Get A Data Type-Related Error)

One of the worst experiences for an intermediate programmer is to debug a sophisticated bit of software only to figure out Python is delivering the wrong data type somewhere in the mix.

Sigh.

I’ve never heard of data science-types using strongly typed languages (TypeScript, Java, C++) for their analyses, but it wouldn’t be the dumbest things I’ve ever heard.

Edit 13 Aug 2019:

I’ve since had a think about my last comments regarding strongly typed languages and data science – and I don’t know what I was thinking. Heaps of DS work is done in strongly typed languages! It is perfectly normal to do analyses in C++, Java, (C!), C#, etc. I think when I wrote that I was thinking, for whatever reason, of TypeScript which is quite young. Hell, I’ve even done work in C#!

The Importance of an Executive Mandate for Effective Analytics

Yesterday I talked with a hiring manager about leading an analytics team at a very large organisation. The role was dual-hatted: project manager and people manager. I found the project management component odd, so I asked. ..

Yesterday I talked with a hiring manager about leading an analytics team at a very large organisation. The role was dual-hatted: project manager and people manager. I found the project management component odd, so I asked. She explained the projects were almost always the implementation of websites / pages to display results from the team. Project size was typically a few million dollars each.

With that in mind, now imagine your team delivering analytical results – to an organisation of thousands of people – and not having enough organisational support to actually publish the results. You’d likely be spending a lot of time publishing, presenting, re-presenting, refreshing, and marketing the results. And that would be less time for the next analysis.

I was pretty impressed they supported the analytics team in this way! They weren’t the most tech-savvy folks, but they found a solution.

I suspect the organisation had either:

  • Learned this lesson the hard way (i.e. analytics without a strong executive mandate doesn’t have much impact) or
  • Has a legislative (or possibly political) requirement to deliver.

Either way, it was great to see a large organisation support analytics in a way that was as concrete as multiple $millions in what is effectively a publishing budget.