What is Data Mining?
Data mining –The process by which patterns are discovered within large sets of data with the goal of extracting useful information from it.
Data mining and machine learning techniques, including Bayesian and neural networks, for diagnosis/prognosis applications in meteorology and climate.
Data mining is the process of extracting nontrivial and potentially useful information, or knowledge, from the enormous data sets available in experimental sciences (historical records, reanalysis, GCM simulations, etc.), providing explicit information that has a readable form and can be used to solve diagnosis, classification or forecasting problems. Traditionally, these problems were solved by direct hands-on data analysis using standard statistical methods, but the increasing volume of data has motivated the study of automatic data analysis using more complex and sophisticated tools that can operate directly from data. Thus, data mining identifies trends within data that go beyond simple analysis. Modern data mining techniques (association rules, decision trees, Gaussian mixture models, regression algorithms, neural networks, support vector machines, Bayesian networks, etc.) are used in many domains to solve association, classification, segmentation, diagnosis and prediction problems.
Among the different data mining algorithms, probabilistic graphical models (in particular Bayesian networks) are a sound and powerful methodology grounded on probability and statistics, which allows building tractable joint probabilistic models that represent the relevant dependencies among a set of variables (hundreds of variables in real-life applications). The resulting models allow for efficient probabilistic inference. For example, a Bayesian network could represent the probabilistic relationships between large-scale synoptic fields and local observation records, providing a new methodology for probabilistic downscaling: i.e. allowing to compute P(observation|large-scale prediction). For instance, the red dots in the figure below correspond to the grid nodes of a GCM, whereas the blue dots correspond to a network of stations with historical records (the links show the relevant dependencies, automatically discovered from data).