Friday, September 24, 2021

Statistics Student take on Data Science


What is Data Science?

Data science is not about making complicated models, it is not about making awesome visualizations and it is not about writing code. Data science is about using data to create as much impact as possible for your company. The impact can be in the form of multiple things; it could be insights, data products or product recommendations for a company. Now to do those things then you need tools like making complicated models or data visualization or writing code. But essentially as a data scientist, your job is to solve real company problems using data. And what kind of tools do you use? They don't care!

There is a lot of misconception about data science especially on youtube and I think the reason for this is that there is a huge misalignment between what's popular to talk about and what's needed in the industry.

Before data science, we popularized the term data mining. In an article called From Data Mining to Knowledge Discovery in Databases data mining is described as the overall process of discovering useful knowledge from data. In 2001 William S. Cleveland wanted to bring data mining to another level. He did that by combing statistics with computer science. Basically, he made statistics a lot more technical which he believed will expand the possibilities of data mining and produce a powerful force for innovation. Now he could take advantage of computing power for statistics and he called this combo Data Science. It is also around this time that web 2.0 emerged. Websites were no longer just a digital pamphlet but a medium for shared experience among millions and millions of users. These were websites like myspace(2003), Facebook(2004) and YouTube(2005). People could now interact with this website meaning that they could contribute, post, comment, like, upload and share leaving their footprints in the digital landscape we call the internet. Eventually, so much data was created that it become too much to handle using traditional technologies. So we call this Big Data. That in turn opened a world of possibilities in finding insights using data. But it also meant that the simplest questions required sophisticated data infrastructure just to support the handling of the data. We needed power log computing technology like Hadoop, MapReduce and Spark. The rise of big data in 2010 sparked the rise of data science to support the needs of businesses and to draw insights from their massive unstructured datasets.

The journal of data science describes data science as almost everything that has something to do with data: Collecting, analyzing, modelling... yet the most important part is its applications--all sorts of applications. eg Machine Learning. In 2010, with the new abundance of data, it was possible to train machines with a data-driven approach rather than a knowledge-driven approach. All the theoretical papers about recuring neuro networks, support vector machines (SVN) become feasible. Some that changed the way we lived and how we experienced things in the world. Deep learning is no longer an academic concept in this thesis papers. It became a tangible useful class of machine learning that would affect our everyday lives. Machine learning and Artificial Intelligence dominated the media overshadowing every other concept of Data Science like exploratory analysis, experimentation and skills we traditionally called Business Intelligence. Now the general public thinks of data scientists as researchers focused on machine learning and AI but the industry is hiring data scientists as analysts so there is misalignment there. The reason being yes most of these Data scientists can work on more technical problems but big companies like Google, Facebook, Amazon have so many low hanging fruits to improve their products that they don't require any advanced machine learning or the statistical knowledge to find these impacts in their analysis.

 Being a good data scientist isn't about how advanced your models are. It is about how much impact you can have with your work. You are not a data cruncher, you are a problem solver. You are a strategist. Companies will give you the most ambiguous and hard data problems and expect you to guide them in the right direction.

Labels: , , , , , , ,

2 Comments:

At September 25, 2021 at 3:31 AM , Blogger Unknown said...

This is insightful

 
At September 25, 2021 at 8:31 AM , Blogger Anselm lums said...

Wow iko fine

 

Post a Comment

Subscribe to Post Comments [Atom]

<< Home