- Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.
- It is a concept to unify statistics, data analysis, machine learning and their related methods in order to “understand and analyze actual phenomena” with data.
- It employs techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, and information science.
Prerequisites for Data Science with Python
- Knowledge of Statistics (Descriptive and Inferential Statistics)
- Knowledge of Mathematics (Linear Algebra and Calculus)
- Programming Knowledge (Python, SQL)
Structured vs. Unstructured Data
- Structured
- Structured data is highly organized and easily understood
- Have Pre-defined data models
- Usually stored in RDBM’s
- Usually text only
- Dates, Phone Numbers, Credit Card Numbers, Names, Addresses are some of the examples of Structured data
- Unstructured
- No Pre-defined data models
- Usually stored in No SQL databases
- May be Text, Image, Audio, Video and other formats
- Blog Comments, Social Media data, Photo Sharing sites, Chat messages, Survey responses are examples of human generated unstructured data
- Sensors data, Scientific data, Satellite data, Surveillance data are some of the examples of machine-generated unstructured data
- Semi-Structured data
- XML, JSON are some of the examples of Semi-Structured data
Data Science process
- Data Collection
- Data Pre-processing
- Data Cleaning (transforming raw data into an understandable format)
- Data Transformation (process of converting data from one format or structure into another format or structure)
- Data Processing
- Exploratory Data Analysis
- Data Visualization
- Model the data
- Data Interpretation and evaluation
- Deployment
What we can achieve with Data Science?
There are various applications of data science in Business, Agriculture, Weather Forecasting, E-Commerce, Manufacturing, Banking, Healthcare, Transport, Finance, Movies, R&D, Retail etc.,
Below are some of the applications of data science
- Recommending the products for the customers. This is done based on the customer purchase history, product browsing history and various other parameters.
- Analysis of customer reviews to produce better products.
- Sentiment Analysis of customer comments, tweets etc showing positive or negative
- Prediction of floods and natural disasters
- Fraud detection in Banking
- Biomedical image analysis
- Automated irrigation system
Some important Data Science Tools and Libraries for Python
- Numpy
- Pandas
- Matplotlib
- Seaborn
- Scipy
- Scikit-learn
- TensorFlow
- Keras
- NLTK (Natural Language Toolkit)
Responsibilities of a Data Scientist Role
- Should analyze large amounts of raw information to find patterns that will help improve the business
- Collecting raw data, preprocessing and doing analysis
- Building models to address business problems
- Presenting information using Visualization tools
- Propose solutions and strategies to business problems
Requirements to become a Data Scientist
- Knowledge of SQL and Python; familiarity with Scala and Java is good to have
- Knowledge of Python Libraries like Numpy, Pandas, Matplotlib, Seaborn, Scipy, Scikit-learn, Keras.
- Knowledge of using visualization tools (e.g. Tableau, Seaborn or Matplotlib) and data frameworks (e.g. Hadoop)
- Experience with distributed data computing tools like Big Data with Hadoop, Hive, Pig, Apache Spark, Scala etc.
- Analytical mind and business acumen
- Strong mathematics and statistics skills
- Critical thinking and problem-solving skills
- Excellent communication and presentation skills
- Graduate in Computer Science, Data Science, Engineering or relevant field
Competitions
If you have learned all the required data science tools and libraries, you can start applying your skills in participating Kaggle Competitions.
- Kaggle is an online community of data scientists and machine learners, owned by Google LLC. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.
- You can join a competition to solve real-world machine learning problems.
- Titanic ML competition is the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works
- For more information on Kaggle, visit
https://www.kaggle.com/
Learn more about Artificial Intelligent and Machine Learning in our upcoming blog articles.
Happy Learning!