These AI Class 10 Notes Chapter 4 Data Science Class 10 Notes simplify complex AI concepts for easy understanding.
Class 10 AI Data Science Notes
Data Science Class 10 Notes
Data science is a multidisciplinary field that utilizes scientific methods, algorithms, processes, and systems to extract insights and knowledge from structured and unstructured data.
It encompasses various techniques from statistics, mathematics, computer science, and domain expertise to analyze and interpret complex data sets.
Note The rock, paper and scissors game based on data science. In this game, the user will be asked to make choice. According to the choice of user and computer, then the result will be displayed along with the choices of both computer and user.
Applications of Data Science Class 10 Notes
Some of the major fields in which data science is extensively used are
Internet Search
Data science plays a critical role in modern internet search. Search engines deal with massive amounts of information, and data science algorithms shift through this data to unclerstand what results are most relevant to your query. This considers factors like the content of the webpages, how others have interacted with those pages in the past, and even your own search history (with your permission). There are various popular search engines such as Google, Yahoo, Bing etc. All these search engines utilize the data science algorithm to return the most informative and helpful results in a fraction of a second.
Transport
Data science is revolutionizing transportation, making it safer, more efficient, and more convenient. One key application is predictive maintenance. By analyzing sensor data from vehicles and infrastructure, companies can anticipate problems before they occur. For instance, airlines can predict engine malfunctions and schedule maintenance, reducing flight delays.
Data science also optimizes routes. Delivery services like UPS use algorithms to factor in traffic patterns and weather to create the most efficient routes for drivers, saving time and fuel. Similarly, ride-hailing apps like Uber analyze data to predict demand and optimize driver placement, reducing wait time for passengers.
Traffic management is another area benefiting from data science. Cities use sensors and cameras to collect real-time traffic data. This data is then analyzed to predict congestion and adjust traffic light timings or deploy emergency services efficiently. Overall, data science is transforming transportation, making it a data-driven industry for the better.
Various logistics companies like DHL, FedEx, etc. make use of Data Science. Data Science helps these companies to find the best route for the Shipment of their Products, the best time suited for delivery, the best mode of transport to reach the destination, etc.
Recommendation Systems
Data science plays a crucial role in powering recommendation systems, enabling platforms like Netflix and Amazon to deliver personalized experiences to their users. These systems analyze vast amounts of data, including user interactions, preferences, and behaviors, to generate tailored recommendations.
For example, Netflix employs collaborative filtering algorithms that compare users’ viewing habits to identify patterns and suggest content similar to what they’ve enjoyed in the past. Additionally, they use content-based filtering, which examines the attributes of shows and movies to recommend items with similar characteristics.
Amazon, on the other hand, utilizes a combination of techniques such as item-to-item collaborative filtering and user behavior analysis to suggest products based on past purchases, searches, and browsing history. These recommendation systems use data science to enhance user satisfaction, increase engagement, and drive sales by delivering relevant content and products, ultimately improving the overall user experience.
Healthcare
Using data science to analyze vast volumes of medical data and enhance patient care data is collected by doctors using wearables and sensors and that data scientists subsequently use this information to track current symptoms or forecast future health problems. Better general health outcomes are made possible by early intervention and individualized treatment regimens.
In the Healthcare Industry data science act as a boon. Data Science is used for:
- Detecting Tumor
- Drug discoveries
- Medical Image Analysis
- Virtual Medical Bots
- Genetics and Genomics
- Predictive Modeling for Diagnosis etc.
Airline Routing Planning
Route planning in the airline business is being revolutionized by data science. Airlines are now able to anticipate flight delays more accurately due to data analysis on a massive scale. This assists them in making well-informed routing decisions, including selecting a route with a stopover that provides a buffer in case of an issue or going the more direct route with a lesser chance of delay.
For example, a flight from Delhi to the USA might take a direct route over the North Pole if data suggests clear skies and favorable winds. However, if weather patterns indicate a higher chance of delays on the direct path, the airline might choose a route with a stopover in Europe or the Middle East. This allows for potential clelays at the stopover point without significantly impacting the arrival time at the final destination.
Image Recognition
Data Science is also used in Image Recognition. For Example, When we upload our image with our friend on Facebook, Facebook gives suggestions Tagging who is in the picture. This is done with the help of machine learning and Data Science. When an Image is Recognized, the data analysis is done on one’s Facebook friends and after analysis, if the faces which are present in the picture matched with someone else profile, then Facebook suggests us auto-tagging.
In Finance
Data science is a game-changer for financial institutions. It helps automate risk analysis, allowing for better strategic decisions. Additionally, data science tools can be used to predict customer lifetime value and even stock market movements. For instance, in the stock market, data scientists analyze historical data to identify patterns that might predict future trends. This doesn’t guarantee perfect foresight, but it provides valuable insights for informed investment decisions.
Data science important for:
- Customer Analysis
- Real time Analysis
- Risk Analysis
- Fraud Detection
- Customer data management
- Algorithm trading ,
Data Science in Games
Data science ideas are combined with machine learning in several games where players compete against a computer opponent in order to improve the capabilities of the computer. With the goal of giving the gamer a more dynamic and demanding gaming experience, the computer learns from past gameplay data and gradually modifies its techniques. This kind of play is frequently seen in games such as chess and EA Sports titles, where the computer opponent constantly modifies its strategy according to trends and results seen in past games.
Revisiting AI Project Cycle Class 10 Notes
Scenario
A large retail chain, “SuperMart,” operates across multiple locations. The company aims to enhance its sales forecasting accuracy to optimize inventory management and increase profitability. SuperMart has historical sales data, inventory records, and information on various promotional campaigns conducted throughout the year. However, they struggle with accurately predicting sales, leading to instances of stockouts or excess inventory.
1. Problem Scoping
Let us look at the various factors creating this problem. By using the 4 Ws canvas we will be able to identify and solve this problem.
Who Canvas : Who is having the problem?
Who are the stakeholders? | Retail analysts, store managers, Supply Chain Managers, IT support staff |
What do you know about them? | Responsible for setting overall business strategies |
Oversee day-to-day operations at individual store locations. | |
Manage the flow of goods from suppliers to customers. |
What Canvas : What is the nature of their problem?
What is the problem? | Inaccurate sales forecasting, leading to suboptimal inventory management. |
What is the problem? | Indicators such as frequent stockouts, high inventory carrying costs, and reactive inventory management practices |
Customer complaints about product availability |
Where Canvas : Where does the problem arise?
- What is the context/situation in which the stakeholders experience this problem?
- Struggle to meet customer demand during peak periods due to inaccurate forecasts
Why Canvas: Why do you think it is a problem worth solving?
What would be of key values to the stakeholders? | Accurate sales forecasts facilitate efficient staffing, inventory stocking, reducing stockouts |
Improving overall supply chain efficiency. | |
What would be of key values to the stakeholders? | Reduced stockouts and excess inventory |
Optimized inventory management practices | |
Enhanced operational efficiency across the retail chain | |
Increased competitiveness in the market through data-driven decision-making |
Based on the 4 W s canvas, the Problem Statement Template can be filled as follows:
Our | Retail analysts, store managers, Supply Chain Managers, IT support staff | Who? |
Have a problem of | Inaccurate sales forecasting, leading to suboptimal inventory management | What? |
While | Inaccurate sales prediction, leading to instances of stockouts or excess inventory | Where? |
An ideal solution would be | To be able to predict the sales of different products | Why? |
This Problem Statement Template has given clarity to the various factors around the problem and the goal can be now be stated as:
“To be able to improve sales predictions and optimize inventory management for better business performance”
2. Data Acquisition
We can now move to the second stage of the Project Life Cycle, i.e. Data Acquisition. To build the AI based project, we need data for training and testing and we also need to understand the kind of data that needs to be gathered to work towards the goal. As per the above scenario, there are various factors that would affect sales forecasting accuracy to optimize inventory management would be:
For this problem statement, a dataset which covers all the mentioned elements is made for each of the project over a period of 30 days. This can be collected historical sales data, inventory records, promotional calendars, PoS transactions, weather data, and economic indicators.
3. Data Exploration
Once the database is ready, we look into the data that are collected and understand what is needed to be done and what exactly is required out of it . In this case, since the goal of our project is to be able to predict the sales and minimize instances of stockouts and over stocking, we need to have the following data that is crucial for planning anything further.
We begin by carefully parsing through the dataset, extracting the relevant information for our analysis. Next, we systematically address any errors or inconsistencies,
correcting inaccuracies and filling in any missing data points. Through this cleaning process, finally, we verify the accuracy of the cleaned dataset through validation checks before proceeding with further analysis or interpretation.
Thus, we extract the required information from the dataset and clean it up in such a way that there no exist no errors or missing elements in it.
4. Modelling
Once the dataset is cleaned and prepared, we move on to training our regression model. Regression is a type of supervised learning where the model learns to predict continuous values based on input features. Given our dataset spanning 30 days of continuous data, regression is well-suited for forecasting future values.
Initially, the model is trained on the first 20 days’ worth of data, allowing it to capture temporal trends and patterns. Subsequently, the model’s performance is assessed by comparing its predictions to the actual values for the remaining 10 days. This iterative process of training and evaluation enables us to fine-tune the model and validate its effectiveness in predicting future values accurately.
5. Evaluation
Once the model has been trained on the training dataset of 20 days, it is now time to see if the model is working properly or not. The steps involved are as follows:
- Compare the performance of different models and configurations using validation metrics.
- Assess the model’s sensitivity to changes in parameters and input variables.
- Validate the model’s performance across different store locations, product categories, and time periods.
- Incorporate feedback from stakeholders and refine the model iteratively.
Data Collection Class 10 Notes
Data collection is a foundational aspect of data science, as it involves gathering raw data from various sources to extract insights, make predictions, and drive decision-making. In the context of data science, data collection is crucial for building accurate models, training algorithms, and solving real-world problems.
Sources of Data
Data sources are used in a variety of ways. On the basis of data requirement and data collection, sources of data are categorized into two ways as
Offline data sources | Online data sources |
Interviews | Reliable websites |
Surveys | Open sources government portals |
Observations | Online forums |
Sensors | Online survey sites |
While accessing data from any of the data sources, which points should be kept in mind
- Define the quality and reliability of the data, including accuracy, completeness, consistency, and timeliness.
- Clarify ownership rights and usage permissions for the data, especially when accessing third-party data sources.
- Implement measures to protect sensitive data during transmission, storage, and processing.
- Ensure compatibility and interoperability between different data sources, formats, and systems when integrating and analyzing data.
- Establish data-sharing agreements, data usage guidelines, and data access controls to facilitate collaborative research and innovation.
Types of Data
- Structured Data Data organized into a tabular format with predefined fields and fixed schema, commonly found in relational databases and spreadsheets.
- Unstructured Data Data that lacks a predefined structure or format, such as text documents, images, videos, and social media posts.
- Semi-Structured Data Data that has some structure but does not equal to the rigid structure of relational databases, often represented in formats like JSON (JavaScript Object Notation) and XML (eXtensible Markup Language).
Data can be collected in various formats depending on the nature of the information being captured, the source of the data, and the intended use.
Here are some common formats in which data is usually collected:
Tabular Data (CSV, Excel) Tabular data is structured into rows and columns, with each row representing a single observation or record, and each column representing a variable or attribute. This format is commonly used for collecting structured data such as survey responses, transaction records, and sensor readings.
JSON (JavaScript Object Notation) JSON is a lightweight data interchange format that is easy for humans to read and write and for machines to parse and generate. It is commonly used for representing semi-structured data such as nested objects and arrays. JSON data is organized into key-value pairs and supports hierarchical structures.
XML (eXtensible Markup Language) XML defines rules for encoding documents in a format that is both human-readable and machine-readable. It is designed to be self-descriptive and extensible, making it suitable for representing hierarchical and semi-structured data. XML is commonly used for data interchange, configuration files, and document storage.
Text Data (Plain Text, PDF) Text data consists of unstructured or semi-structured textual content, such as documents, articles, emails, and social media posts. Text data is commonly collected for natural language processing (NLP) tasks such as sentiment analysis, text classification, and named entity recognition.
Image Data (JPEG, PNG) Image data consists of pixel values representing visual information such as photographs, diagrams, charts, and graphics. Image data is commonly collected for computer vision tasks such as object detection, image classification, and image segmentation.
Audio Data (WAV, MP3) Audio data consists of digitized sound waves representing speech, music, or other auditory signals. Audio data is commonly collected for tasks such as speech recognition, speaker identification, and audio classification.
SQL (Structured Query Language) SQL is a programming language used for managing and manipulating relational databases. It provides a standardized way to interact with databases;-allowing users to perform various operations such as querying data, modifying data, defining database structures, and managing database access permissions.
Data Access Class 10 Notes
After collecting the data, one should know how to access it in a Python code. There are different Python packages which help us in accessing structured data inside the code. Some of these packages include NumPy, Pandas and Matplotlib.
NumPy
NumPy, which stands for Numerical Python is a fundamental package for numerical computing in Python. It provides support for multi-dimensional arrays and matrices, alorg with a collection of mathematical functions to operate on these arrays efficiently.
NumPy gives a wide range of arithmetic operations around numbers giving us an easier approach in working with them. NumPy also works with arrays, which is nothing but a homogeneous collection of data.
NumPy Array
An array is a fundamental data structure used in programming to store a collection of elements of the same data type. It provides a contiguous block of memory to hold the elements, allowing for efficient access and manipulation of data. Arrays are commonly used for organizing and processing data in various algorithms and applications.
Note NumPy offers an array object called ndarray:
Using NumPy, one can perform the following operations:
- Array creation NumPy offers functions to create arrays from lists, ranges, and other data structures.
- Mathematical operations It provides a wide range of mathematical functions for array manipulation, linear algebra, Fourier analyşis, and more.
- Indexing and slicing NumPy arrays support advanced indexing and slicing operations for data extraction and manipulation.
Difference between NumPy Array and List
Pandas
Pandas is a powerful and versatile Python library designed for data manipulation and analysis. It provides high-performance, easy-to-use data structures, primarily DataFrame and Series, which enable efficient handling of structured data.
Pandas excels at tasks such as data cleaning, transformation, exploration, and analysis, making it well-suited for a wide range of data-related tasks in fields such as data science, finance, economics, and research.
Pandas allows users to perform tasks like indexing, slicing, grouping, aggregating, merging, and reshaping data with ease. Whether working with small or large datasets, Pandas offers intuitive and flexible tools for extracting insights, visualizing data, and making data-driven decisions.
Pandas provide two data structures for processing the data as
(i) Series It is one dimensional array with homogeneous data. All the elements of series should be of same data type.
(ii) DataFrame It is a two dimensional array with heterogeneous data, usually represented in the tabular format. The data is represented in rows and columns. Each column represents an attribute and each row represents a person.
Here are a few things that Pandas does well
- It provides tools for computing statistics, aggregating data, and extracting insights.
- Pandas enąbles users to explore datasets quickly through indexing, slicing, and visualization.
- With built-in functions, Pandas simplifies tasks like handling missing data and removing duplicates.
- It seamlessly integrates with other Python libraries for data analysis, visualization, and machine learning.
- Pandas efficiently handles large datasets, supporting operations like chunking and out-of-core processing.
- Pandas offers specialized tools for analyzing time series data, including resampling and rolling window calculations.
- Users can customize and extend Pandas’ functionality to meet specific data analysis needs.
- Pandas supports reading and writing data from various file formats, simplifying data import and export tasks.
- With a large and active community, Pandas provides extensive documentation, tutorials, and resources for users of all skill levels.
Benefits of Pandas
Some important features of Python Pandas are as follows
Handling of Data Python library provides fast and efficient way to manage and explore the data. It provides two methods or structures as Series and DataFrames, which help us not only to represent data efficiently but also manipulate the data in various methods.
Input and Output tools Pandas provide an extremely simple wide array of built-in tools such as input and output tools for the purpose of reading and writing data.
Visualise Visualising the data is an important part of data science. It makes the results of the study understandable by human eyes. Pandas have in-built ability to help you plot your data and see the various kinds of graphs formed.
Grouping With the help of this feature of Pandas you can split data into categories of your choice, according to the criteria you set. The GroupBy function splits the data, implements a function and then combines the result.
Merging and joining of datasets While analysing data, we constantly need to merge and join multiple datasets to create a final dataset to be able to properly analyse it. Pandas can help to merge various datasets, with extreme efficiency so that we don’t face any problems while analysing the data.
Optimised performance Pandas have a really optimised performance, which makes it really fast and suitable for data. The critical code for Pandas is written in C or Cython, which makes it extremely responsive and fast.
Matplotlib
Matplotlib is Python package used for data visualization. It is a cross-platform library for making 2D plots from data in arrays. It provides an object-oriented API that helps in embedding plots in applications using Python GUI toolkits such as PyQt, WxPython, or Tkinter.
Matplotlib is an open-source Python library that offers various data visualization (like Line plots, histograms, scatter plots, bar charts, Scatter plots, Pie Charts, and Area Plot etc). The script of Matplotlib is structured which denotes that a few lines of code are all that are required in most instances to generate a visual data plot. Matplotlib simplifies simple tasks and enables complex tasks to be accomplished.
Advantages of Matplotlib
Matplotlib offers several advantages that make it a popular choice for data visualization in Python:
- Matplotlib can create a wide range of plots for different types of data.
- You have the freedom to tweak every aspect of your plots, from colours to labels, to suit your needs.
- Plots made with Matplotlib are polished and suitable for professional use, such as in reports or presentations.
- It seamlessly works with other popular Python libraries for data analysis, enhancing your visualization capabilities.
- You can save your plots in various formats like images or PDFs, making them suitable for different uses.
- Matplotlib works reliably on different operating systems, ensuring consistency across platforms.
- With a large community of users and contributors, Matplotlib receives ongoing support, updates, and improvements, ensuring its reliability and relevance.
Basic Statistics in Python
In Python, you can calculate these statistical measures using libraries such as NumPy and statistics. These statistical measures are used to understand the distribution, central tendency, and variability of data, which in turn helps in making informed decisions and drawing meaningful insights from the dataset.
Mean The mean is the average value of a dataset. It is calculated by adding up all the values in the dataset and then dividing by the number of values. The mean provides a measure of central tendency, indicating the typical value in the dataset.
Median The median is the middle value of a dataset when it is sorted in ascending order. If the dataset has an odd number of values, the median is the middle value. If the dataset has an even number of values, the median is the average of the two middle values. The median is also a measure of central tendency and is less affected by extreme values (outliers) compared to the mean.
Mode The mode is the value that appears most frequently in a dataset. A dataset can have one mode (unimodal), two modes (bimodal), or more than two modes (multimodal). The mode provides insight into the most common value-or category in the dataset.
Variance Variance measures the spread or dispersion of the values in a dataset. It quantifies how much the values in the dataset differ from the mean. A higher variance indicates that the values are more spread out from the mean, while a lower variance indicates that the values are closer to the mean.
Standard Deviation Standard deviation is another measure of the spread or dispersion of the values in a dataset. It is the square root of the variance. Like variance, standard deviation indicates how much the values deviate from the mean. A higher standard deviation suggests greater variability in the dataset, while a lower standard deviation suggests less variability.
Data Visualization Class 10 Notes
During data collection, several types of problems or errors may occur, which can affect the quality and reliability of the dataset.
Some common types of data collection problems include:
- Measurement Errors These errors occur when there is inconsistency or inaccuracy in the measurement process. It could be due to human error, instrument calibration issues, or environmental factors affecting measurement accuracy.
- Response Errors Response errors occur when respondents provide inaccurate or misleading information. This can happen due to misunderstanding survey questions, social desirability bias, or deliberate falsification of responses.
- Missing Data Missing data occur when certain observations or values are not recorded or are incomplete. This can result from survey non-response, equipment malfunction, or data corruption during transmission or storage.
- Outliers They can occur due to measurement errors, natural variability, or extreme events. Outliers can distort statistical analyses and may need to be identified and addressed appropriately.
Graphs in Matplotlib
1. Line Plot
Line plots are drawn by joining straight lines connecting data points where x -axis and y -axis values intersect. In Matplotlib, the plot() function represents this type of graph.
2. Bar Plot
The bar plots are vertical/horizontal rectangular graphs that show data comparison were you can changes over a period represented in another axis. In Matplotlib, we use the bar() or barh() functions to represent it.
3. Scatter Plot
We can implement the scatter plots while comparing various data variables to determine the connection between dependent and independent variables. The scatter() function is a tool for creating scatter plots in Matplotlib.
4. Pie Plot
A pie plot is a circular graph where the data gets represented within that components/segments or slices of pie. In Matplotlib, the pie() function represents it.
5. Area Plot
The area plots spread across certain areas with bumps and drops (highs and lows) and are also known as stack plots. In Matplotlib, the stackplot() function represents it.
6. Histogram Plot
It is an estimate of the probability distribution of a continuous variable. The matplotlib hist() function plots a histogram.
Data Science : Classification Model Class 10 Notes
A data science classification model is a predictive algorithm designed to categorize or classify new instances into predefined classes or categories based on input features. It is a fundamental tool in supervised learning, where labelled data is used to train the model to recognize patterns and relationships between features and class labels.
Personality Prediction
Personality prediction is a fascinating area of research within psychology and data science that aims to understand and forecast an individual’s personality traits based on various data sources, such as social media posts, digital footprints, or self-reported surveys.
The goal of personality prediction is to gain insights into human behaviour, tailor personalized experiences, and inform decision-making processes in fields like marketing, recruitment, mental health, and personalized recommendation systems.
K-Nearest Neighbour Model
- K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique.
- K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories.
- K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification problems.
- K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
- It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset.
How does K-NN Work?
The K-NN working can be explained on the basis of the below algorithm:
- Step-1 Select the number K of the neighbors
- Step-2 Calculate the Euclidean distance of K number of neighbors
- Step-3 Take the K nearest neighbors as per the calculated Euclidean distance.
- Step-4 Among these K neighbors, count the number of the data points in each category.
- Step-5 Assign the new data points to that category for which the number of the neighbor is maximum.
- Step-6 Our model is ready.
In KNN, K is the number of nearest neighbours. The number of neighbours is the core deciding factor. K is generally an odd number if the number of classes is 2. When K=1, then the algorithm is known as the nearest neighbour algorithm. This is the simplest case.
Suppose P1 is the point, for which label needs to be predicted.
Advantages of K-NN
- Simple to Understand Easy to grasp and implement, making it suitable for beginners.
- No Training Phase It doesn’t require a training phase since it memorizes the entire dataset.
- Non-Parametric It doesn’t make any assumptions about the underlying data distribution.
- Versatile It can be used for both classification and regression tasks.
- Handles Multi-Class Problems Naturally extends to multi-class classification problems.
- Robust to Outliers Outliers have less influence due to the majority voting scheme.
Disadvantages of K-NN
- Computationally Expensive Needs to compute distances to all data po ints, making it slow for large datasets.
- Memory Intensive Store’s the entire training dataset, which can be memory-in ten sive for large datasets.
- Sensitive to Irrelevant Features Performance can degrade if irrelevant features are present.
- Requires Feature Scaling Performance may be affected by differences in feature scales.
- Limited Interpretability Pro vides little insight into the underlying relationship between features and target.
Glossary:
- Artificial Intelligence (Al) It utilizes dat a, employing techniques from data science to extrac t insights and patterns
- Data Science It deals with structured da ta, which is typically organized in databases or tables: with rows and columns
- NLP It primarily works with unstructured textual data, which includes written or spoken human lainguage
- Computer Vision It deals with visual data, in cluding images and videos.
- A data scientist It is responsible for extracting insights and knowledge from large volumes of data using various statistical, mathematical, and computa tional techniques.
- NumPy It stands for Numerical Python is a funclamental package for numerical computing in Python
- Pandas It is a powerful and versatile Python librar y designed for data manipulation and analysis
Matplotlib It is a versatile Python library for creating! static, interactive, and publication-quality visualizations
The post Data Science Class 10 AI Notes appeared first on Learn CBSE.