24May

Data Science Overview: The Bigger PictureFor the last ~5 years, “data science” has been a hot term that continuously gathered traction both in the IT world and the media. With buzzwords like “big data” and “machine learning”, data science rose to prominence as the technology that could solve every problem imaginable… well, at least it’s hailed as one. This praise, although not completely unwarranted, seems too good to be true — so how does data science function in the real world?

Much like cryptocurrencies, it’s an incredibly broad technology with lots of implications, limitations, and intricacies — so summarizing this topic in a few sentences is impossible. Instead, we continue our “Fundamentals” series (its previous installment covers hiring a freelance web developer) and dedicate a whole article to exploring data science more thoroughly: what is data science? How can it be used? What technologies are involved? Which problems and caveats should be avoided? Let’s find out.

Getting the definition right

Data Science Overview: The Bigger Picture
Too bad not all dictionaries have caught up

First of all, we should stress that data science is a multidisciplinary field that incorporates a number of technologies, professions, and approaches. Journal of Data Science, a publication that has been at the forefront of this field, defines data science as almost everything that has something to do with data: Collecting, analyzing, modeling… yet the most important part is its applications — all sorts of applications.

In a sense, it can be compared to the term “medicine”, so a phrase like “I’m a data scientist” is as ambiguous as “I’m a doctor” — it does point in the general direction of what this person does, but doesn’t really explain anything beyond that. A good method of understanding a profession can be summarized in the following question: “What problems do the people of this profession solve?” Well, data scientists use data to create an impact for a given business, research, etc.

The rise of data-driven approach is heavily linked to the emergence of Web 2.0: in essence, this new rendition of the internet emphasized interactivity and user-generated content, allowing users to leave so-called “digital footprints” — videos they liked, pictures they shared, search queries they sent, and much more. This information came to be known as “big data”.

Individually, these data points aren’t insightful at all (“John Doe from Michigan City, Indiana, liked a video about the Linux file system” — so what?), but collectively, they reveal hidden correlations, trends, and — in essence, data science provides bird’s-eye view over the problem.

Creating impact

Data Science Overview: The Bigger Picture
In other words, are we even going in the right direction?

The very essence of any data science job is solving problems: the company provides you with (often ambiguous) data sets and expects you to make something out of it — something that can help solve a real business problem. When analyzed correctly, data can provide a lot of valuable insights into how the business operates at the moment — and which aspects could be improved upon. Some of them are:

  • Identifying key business metrics that should be tracked
  • Predicting the performance of these metrics
  • Predicting the behavior of customers
  • Testing product changes via experiments
  • Improving the product via the creation of data products

Success metrics and tracking metrics

Proficiency in data science/analysis doesn’t end with formal technical competencies, though: success and tracking metrics also need to be analyzed. The problem with this skill set lies in its complexity: understanding the key metrics of the product/service is something that product managers typically do — this means that the data scientist should also understand the product they’re working on well.

For a typical social media platform, success metrics would include:

  • Active users
  • Quality standards
  • User satisfaction

Tracking metrics, on the other hand, are:

  • Time spent (per user)
  • Content consumed (e.g. how many videos/pictures watched)
  • Content interacted with (e.g. likes, comments, shares)
  • Content created (e.g. how many videos/pictures uploaded)
  • Forced quits (because every instance of negative user experience is equally important)

Technologies and skills

Data Science Overview: The Bigger Picture
Pride of any data scientist

Still, data science is heavily tied to all technological tools that enhance it: programming languages create the infrastructure, then tests are conducted, and then these findings need to be visualized and communicated.

Python & R

These programming languages are the best picks for any data science. It’s important to understand that they’re not competing for the “Best Language” title (in terms of raw numbers, Python is far more popular and developed); rather, these are two distinct tools and proficient data scientists can use both of them to their fullest potential.

Python excels at data processing thanks to its object-oriented and general-purpose nature: Python’s community has created a lot of packages (e.g. pandas, numpy, scikit-learn) optimal for data science and especially for machine learning.

R, although not as flexible as its counterpart, focuses on data analysis and really excels in this field — its packages allow for easy statistical computing, visualization (ggplot2), and standalone analysis.

Testing

Creating an impact is important, but it also needs to be tested sufficiently: just how well is this new feature doing? Has it improved our success metrics in any way? Does it have the potential to do it? When conducting tests, data scientists also need to gauge how many people they need in these testing groups, constantly balancing between “not enough participants for adequate analysis” and “so many participants that it’s starting to drain the budget”.

Therefore, data scientist utilize various testing techniques to ensure that the product is headed in the right direction. With A/B testing (also called split testing), data scientists divide their users into two groups and expose Feature A to Audience A and Feature B to Audience B. To do this test correctly, the difference between the two audiences must be marginal, while the difference between the two features must boil down to a single variable (e.g. putting the “Sign up now!” button either at the top or at the bottom)

SQL

In most cases, companies build their infrastructure in a SQL-driven manner to query all data. SQL, or Structured Query Language, is an indispensable tool for organizing data: while remote web developers work their front-end magic via the likes of Angular, React, and Vue.js, SQL manages the dirty work — ensuring that all the precious data is structured accordingly.

Therefore, SQL proficiency is crucial for any data science-related activity: retrieving and working with data is often done purely via SQL. Most importantly, SQL knowledge allows data scientists to interpret the structure, meaning, and relationships in source data, which then can be shaped for later use in analytical purposes. By the way, we’ve got some great SQL interview questions!

Caveats and misconceptions

Data Science Overview: The Bigger Picture
Image credit: xkcd

Data science as a field received a lot of media attention (including privacy-related scandals like Cambridge Analytica) and there haven’t been just enough specialists to clarify misconceptions that arose during these events. Let’s clear some of these up:

Companies that don’t have a clear data science strategy

For some companies, “data science” is merely a hip trend to follow — but nothing more. This leads to situations where only one data scientist is hired; this poor data scientist is then expected to do all data-related assignments, effectively doing the work of different specialists.

To quote data science evangelist Jonny Brooks-Bartlett: Following on from doing anything to please the right people, those very same people with all of the clout often don’t understand what is meant by “data scientist”. This means that you’ll be the analytics expert as well as the go-to reporting guy and let’s not forget that you’ll be the database expert too.

This can be acceptable in a startup environment where the responsibilities of team members aren’t stable yet; in larger/older companies, however, this move might not work at all. As Ben Weber, Medium author who regularly busts data science-related myths, puts it: the company needs a clear understanding of how data science can improve its product. This means that simply hiring a data scientist is not enough; instead, company managers need to establish a data-oriented pipeline and avoid the “one person, three jobs” mentality.

“Data science = programming” or “Data science is purely technical”

Although programming skills are absolutely vital for this field, a proficient software engineer does not equal proficient data scientist: for the latter group, mathematics and statistics are arguably more important; while for some data analysts, programming per se isn’t even part of their job: they query all data via SQL and operate with/visualize it via Excel.

There is a skill that gets cited often: communication. In a technical profession, this competence seems wildly alien and out of place. However, finding data is not enough — data scientist also needs to communicate their findings, i.e. explain what they mean and what decision the management team should meet to create a positive impact.

Another important skill is visualization: it accompanies communication, helping the management team understand data easier. These two skills allow the data scientist to become a real storyteller, transforming rather abstract data into insightful and captivating stories. Public speaking and formal writing are valuable additions as well as there’s usually a lot of back-and-forth communication with product managers and engineers.

Data Science Overview: The Bigger Picture
Data visualization rocks! (This is how data science search trends compare)

Conclusion

Data science is a large field with a lot of fascinating topics to cover — after all, there are entire blogs and journals dedicated to this area. Hopefully, we’ve also elaborated on these questions in a sufficient manner. 🙂 Although it features the word “science”, data science isn’t completely technical or math-focused — there’s a lot of room for creativity and novelty.

Code Style

The main things that I would like to draw your attention to:
1) The names variables and methods should be clear and concise, don’t skimp on characters.
2) Methods should not exceed 30-40 lines, they are intended for solving a single specific task, and poor methods do everything.

Leave a Reply