Short History Lesson that Data Scientists Should Know
Data science was born when statistics and computer science met. Data was important for a long time throughout history, but it wasn’t until recently that data science has evolved. In 1962, a statistician John W.Tukey wrote in his book “The Future of Data Analysis”, how he realized that his main interest was data analysis. He saw the importance of the rise of stored-program electronic computer.
Later on, in 1974 Peter Naur publishes “Concise Survey of Computer Methods” where he defines data science as “The science of dealing with data, once they have been established, while the relation of the data to what they represent is delegated to other fields and sciences.”
The International Association for Statistical Computing was established in 1977 with the aim to convert data into information. In 1989 the first workshop took place to practice data-driven discovery. The first time when the term “data science” was included in the conference was in 1996 in Japan.
Let’s rewind the history a bit and finish this short history lesson with a fun fact: In September 2012, data science was seen as the sexiest job of the 21st century, according the Harvard Business Review.
Data Science vs Big Data
Data science is the field of science that deals with data analysis, data cleansing, data preparation. Big data refers to big amount of data that can analyze insights which can help businesses improve their strategy and make better decisions.
Data science finds place in digital advertising, recommender systems and internet search whereas big data is mostly for finance, retail and communication.
It is safe to say that data science evolves from statistics, mathematics and programming. Data science is the ability to look out of the box and to capture data in innovative ways.
Big data is usually not possible to store on one computer because it is really a huge amount of raw data. With so much information, you can say that big data is endless source of resources that can help every business. Big data is used in customer, fraud and compliance analytics.
Skills You Need to Be a Data Scientist
If you wonder how to become a data scientist, take a look at the skills you need to have:
Programming skills: Python, R (replacement for Excel, for running statistical analysis) or SAS, SQL, Java, Perl, C++ (not necessarily all of them)
Math: linear algebra, calculus and probability
Statistics: parameter estimation, maximum likelihood estimator, hypothesis, bayesian analysis, linear regression, non-linear regression, categorical data analysis etc.
Machine learning: random forests, nearest neighbors
Big data platforms such as Hadoop platform and ability to work with unstructured data
Data mining, data cleaning: What is data mining and how to use it
Software engineering skills: to develop data driven products
Data intuition: Problem solving, ability to recognize what things are important and what are not
Industry knowledge: depending on the industry where you want to work, you need to get informed about the way data is collected and analyzed in that specific industry.
What Tools Should Data Scientists Know?
Now when you have learned about data scientist skills, let’s see what tools you should know. Almost every set of skills mentioned above, goes hand in hand with the tools, for instance- to have programming skills you need to learn to use different programming tools, for data visualization you also need certain tools.
Depending on your role as a data scientist, tools might differ. Some data scientists focus on data analysis, other on research. There are decision scientists, business intelligence analysts, risk and fraud engineers, big data software engineers, machine learning engineers. So different roles require different tools for data science.
For data analysis it is required to know the R project for statistical computing, Pandas (a set of Python libraries), Julia language (alternative to R).
MySQL, CSV files, Hive/Shark/Redshift are tools for data warehousing.
Data visualization tool that is used to put your work on the web is D3.js. Another data visualization tool for Python plotting is Matplotlib whereas ggplot2 is for R language. Excel and Tableau can also be good to visualize data.
For machine learning data scientists use programming tools such as R, Python (Sci-Kit learn), Spark (MlLib), Matlab, Knime, Weka, Rapidminer.
Hadoop platform and SQL are tools for data storage, whereas Bash is for data cleaning, and Python as well. Python and Ruby are also tools for prototyping.
Statistics tools for data exploration, prototyping, hypothesis tests, algorithm development are SAS and R.
Some of the tools that data scientists use for data engineering tasks are: Java (for writing production codes), Cassandra, Spark, Splung, Pig, Hive, Scala, Apache Hadoop etc.
Given that technology changes and evolves very quickly, by the time you find this article, some new alternative tools might show up . Do share your suggestions in the comments section to help other readers stay up-to-date.
Data science field is so big and it will be even bigger. New skills will arise and if you want to be a good data scientist you need to follow the trends and never stop learning new tools. Hope this list of most common tools that employers expect from data scientists to know is a good start. Depending on the type of job, the company and the employer, your set of skills and tools may differ. At the end, you can always ask your employer what he/she expect you to know. Or when applying for jobs, make sure you read carefully what data science tools you need to know.