Data Mining Interview Questions and Answers
by Sachin, on Jul 29, 2022 10:24:11 PM
Q1. What is Data Mining?
Data mining refers to extracting or mining knowledge from large amounts of data. In other words, Data mining is the science, art, and technology of discovering large and complex bodies of data in order to discover useful patterns.
Q2. What are the different tasks of Data Mining?
- Association Rule Discovery
- Sequential Pattern Discovery
- Deviation Detection
Q3. What are the 5 data mining techniques?
- Classification analysis. This analysis is used to retrieve important and relevant information about data, and metadata.
- Association rule learning.
- Anomaly or outlier detection.
- Clustering analysis.
- Regression analysis.
Q4. What are advantages of data mining?
Q5. What is classification of data mining?
Classification is a data mining function that assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. For example, a classification model could be used to identify loan applicants as low, medium, or high credit risks.
Q6. What are the 4 characteristics of data mining?
- Large quantities of data. The volume of data so great it has to be analyzed by automated techniques e.g. satellite information, credit card transactions etc.
- Noisy, incomplete data.
- Complex data structure.
- Heterogeneous data stored in legacy systems.
Q7. What is OLAP in data mining?
Q8. Explain the process of KDD?
Data mining treat as a synonym for another popularly used term, Knowledge Discovery from Data, or KDD. In others view data mining as simply an essential step in the process of knowledge discovery, in which intelligent methods are applied in order to extract data patterns.
Q9. What are the different types of Data Mining?
- Data cleaning
- Pattern evaluation
- Data transformation
- Knowledge representation etc.
Q10. What are the different techniques used for Data Mining?
Prediction: This technique specifies the relationship between independent and dependent instances. For example, while considering sales data, if we want to predict the future profit, the sale acts as a separate instance, whereas the payoff is the dependent instance. Accordingly, based on sales and profit's historical data, the associated profit is the predicted value.
Decision trees: It specifies a tree structure where the decision tree's root acts as a condition/question having multiple answers. Each answer sets to specific data that helps in determining the final decision based on the data.
Clustering analysis: This technique specifies that a cluster of objects having similar characteristics is formed automatically. The clustering method defines classes and then places suitable objects in each class.
Sequential Patterns: This technique is used to specify the pattern analysis used for discovering identical patterns in transaction data or regular events. For example, customers' historical data helps a brand identify the patterns in the transactions that happened in the past year.
Classification Analysis: This is a Machine Learning based method in which each item in a particular set is classified into predefined groups. It uses advanced techniques like linear programming, neural networks, decision trees, etc.
Association rule learning: This technique is used to create a pattern based on the items' relationship in a single transaction.
Q11. What is Prediction?
Prediction can be viewed as the construction and use of a model to assess the class of an unlabeled object, or to measure the value or value ranges of an attribute that a given object is likely to have. In this interpretation, classification and regression are the two major types of prediction problems where classification is used to predict discrete or nominal values, while regression is used to predict incessant or ordered values.
Q12. What are the different stages used in "Data Mining"?
- Exploration: Exploration is the first stage of Data Mining. This stage involves the preparation and collection of different data sets like cleaning, transformation, etc. Based on different types of available data sets, various tools are used to analyze the data.
- Model building and validation: This is the validation stage where the data sets are validated by applying different models by comparing the data sets for best performance. This particular step is called pattern identification. This is a critical process because the user has to identify which pattern is best suitable for easy predictions.
- Deployment: This is the last stage where the best-chosen pattern is applied for the data sets. It is used to generate predictions, and it helps in estimating expected outcomes.
Q13. Explain Bayesian classification in Data Mining?
A Bayesian classifier is a statistical classifier. They can predict class membership probabilities, for instance, the probability that a given sample belongs to a particular class. Bayesian classification is created on the Bayes theorem. A simple Bayesian classifier is known as the naive Bayesian classifier to be comparable in performance with decision trees and neural network classifiers. Bayesian classifiers have also displayed high accuracy and speed when applied to large databases.
Q14. What are Neural networks?
A neural network is a set of connected input/output units where each connection has a weight associated with it. During the knowledge phase, the network acquires by adjusting the weights to be able to predict the correct class label of the input samples. Neural network learning is also denoted as connectionist learning due to the connections between units. Neural networks involve long training times and are therefore more appropriate for applications where this is feasible. They require a number of parameters that are typically best determined empirically, such as the network topology or “structure”. Neural networks have been criticized for their poor interpretability since it is difficult for humans to take the symbolic meaning behind the learned weights. These features firstly made neural networks less desirable for data mining.
Q15. What is Clustering Algorithm in Data Mining?
In Data Mining, the clustering algorithm is used to group sets of data with similar characteristics (also known as clusters). By the use of these clusters, we can make faster decisions and explore data. First, this algorithm identifies the relationships in a dataset, and then it generates a series of clusters based on the relationships. The process of creating clusters is also repetitive.
Q16. What do you understand by DMX in the context of Data Mining?
DMX is an acronym that stands for Data Mining Extensions. It is a query language for Data Mining models supported by Microsoft's SQL Server Analysis Services product. Same as SQL also supports a data definition language, data manipulation language, and a data query language, all three with SQL-like syntax.
- Data Definition: This is used to define and create new models and structures.
- Data Manipulation: This is used to manipulate data based on the requirement.
Q17. What is a Genetic Algorithm?
Genetic algorithm is a part of evolutionary computing which is a rapidly growing area of artificial intelligence. The genetic algorithm is inspired by Darwin’s theory about evolution. Here the solution to a problem solved by the genetic algorithm is evolved. In a genetic algorithm, a population of strings (called chromosomes or the genotype of the gen me), which encode candidate solutions (called individuals, creatures, or phenotypes) to an optimization problem, is evolved toward better solutions. Traditionally, solutions are represented in the form of binary strings, composed of 0s and 1s, the same way other encoding schemes can also be applied.
Q18. Name areas of applications of data mining?
- Data Mining Applications for Finance
- Crime Agencies
- Businesses Benefit from data mining
Q19. Explain Association Algorithm In Data Mining?
Association analysis is the finding of association rules showing attribute-value conditions that occur frequently together in a given set of data. Association analysis is widely used for a market basket or transaction data analysis. Association rule mining is a significant and exceptionally dynamic area of data mining research. One method of association-based classification, called associative classification, consists of two steps. In the main step, association instructions are generated using a modified version of the standard association rule mining algorithm known as Apriori. The second step constructs a classifier based on the association rules discovered.
Q20. Define Tree Pruning?
When a decision tree is built, many of the branches will reflect anomalies in the training data due to noise or outliers. Tree pruning methods address this problem of over-fitting the data. So the tree pruning is a technique that removes the overfitting problem. Such methods typically use statistical measures to remove the least reliable branches, generally resulting in faster classification and an improvement in the ability of the tree to correctly classify independent test data. The pruning phase eliminates some of the lower branches and nodes to improve their performance. Processing the pruned tree to improve understandability.
Q21. Define Chameleon Method?
Chameleon is another hierarchical clustering technique that utilization dynamic modeling. Chameleon is acquainted with recover the disadvantages of the CURE clustering technique. In this technique, two groups are combined, if the interconnectivity between two clusters is greater than the inter-connectivity between the object inside a cluster/ group.
Q22. .What is the K-means algorithm?
K-means clustering algorithm – It is the simplest unsupervised learning algorithm that solves clustering problems. K-means algorithm partition n observations into k clusters where each observation belongs to the cluster with the nearest mean serving as a prototype of the cluster.
Q23. What is the simple difference between Principal Component Analysis (PCA) and Factor Analysis (FA)?
Among numerous differences, the significant difference between PCA and FA is that factor analysis is utilized to determine and work with the variance between variables, but the point of PCA is to explain the covariance between the current segments or variables.
Q24. What is the difference between Data Mining and Data Profiling?
- Data Mining: Data Mining refers to the analysis of information regarding the discovery of relations that have not been found before. It mainly focuses on the recognition of strange records, conditions, and cluster examination.
- Data Profiling: Data Profiling can be described as a process of analyzing single attributes of data. It mostly focuses on giving significant data on information attributes, for example, information type, recurrence, and so on.
Q25. What is the difference between univariate, bivariate, and multivariate analysis?
- Univariate: A statistical procedure that can be separated depending on the check of factors required at a given instance of time.
- Bivariate: This analysis is utilized to discover the distinction between two variables at a time.
- Multivariate: The analysis of multiple variables is known as multivariate. This analysis is utilized to comprehend the impact of factors on the responses.
Q26. What are the required technological drivers in Data Mining?
In Data Mining, we have to deal with mainly two things, database size, and query complexity.
- Database size: In Data Mining, we have to maintain and process a vast amount of data, so we must have a robust system with enough storage space.
- Query Complexity: To analyze the complex and large number of queries, we must require a powerful system with enough RAM.