Technology of operational analytical data processing. Ways of analytical data processing for decision support. The place of OLAP in the information structure of the enterprise

3.4 Methods of analytical data processing

In order for the existing data warehouses to contribute to the adoption of management decisions, the information must be presented to the analyst in the right form, i.e. he must have developed tools for accessing and processing data from the warehouse.

Very often, information and analytical systems created for direct use by decision makers turn out to be extremely easy to use, but severely limited in functionality. Such static systems are called Executive Information Systems (ISS), or Executive Information Systems (EIS). They contain many requests and, while sufficient for a day-to-day review, are unable to answer all questions that may arise when making decisions. The result of the operation of such a system, as a rule, are multi-page reports, after careful study, which the analyst has a new series of questions. However, each new request, not foreseen when designing such a system, must first be formally described, coded by the programmer, and only then executed. The waiting time in this case can be hours and days, which is not always acceptable.

Online analytical processing. Or On-Line Analytical Processing, OLAP is a key component of organizing data warehouses. The concept of OLAP was described in 1993 by Edgar Codd and has the following requirements for applications for multidimensional analysis:

– multidimensional conceptual representation of data, including full support for hierarchies and multiple hierarchies (key OLAP requirement);

– providing the user with the results of the analysis in an acceptable time (usually no more than 5 s), even at the cost of a less detailed analysis;

– the ability to perform any logical and statistical analysis specific to a given application and save it in a form accessible to the end user;

– multi-user access to data with support for appropriate locking mechanisms and authorized access tools;

- the ability to access any necessary information, regardless of its volume and storage location.

An OLAP system consists of many components. At the highest presentation level, the system includes a data source, a multidimensional database (MDB) that provides the ability to implement an OLAP reporting engine, an OLAP server, and a client. The system is built on the client-server principle and provides remote and multi-user access to the MDB server.

Consider the components of an OLAP system.

Sources. The source in OLAP systems is the server that supplies the data for analysis. Depending on the area of ​​use of the OLAP product, the source can be a data warehouse, an inherited database containing general data, a set

tables that combine financial data or any combination of the above.

Data store. The initial data is collected and placed in a warehouse designed in accordance with the principles of building data warehouses. CD is a relational database (RDB). The main data table (fact table) contains the numerical values ​​of the indicators for which statistical information is collected.

Multidimensional database The data store serves as an information provider for a multidimensional database, which is a collection of objects. The main classes of these objects are dimensions and measures. Dimensions include sets of values ​​(parameters) by which data is indexed, for example, time, regions, type of institution, etc. Each dimension is filled with values ​​from the corresponding tables of dimensions of the data warehouse. The set of measurements determines the space of the process under study. Metrics are multidimensional data cubes (hypercubes). The hypercube contains the data itself, as well as aggregate sums for the dimensions that make up the indicator. The indicators make up the main content of the MDB and are filled in in accordance with the fact table. Along each axis of the hypercube, the data can be organized into a hierarchy representing different levels of detail. This allows you to create hierarchical dimensions, according to which, during subsequent data analysis, the aggregation or refinement of the data presentation will be carried out. A typical example of a hierarchical dimension is a list of territorial objects grouped by districts, regions, districts.

Server. The application part of the OLAP system is the OLAP server. This component does all the work (depending on the system model), and stores all the information that is actively accessed. Server architecture is governed by different concepts. In particular, the main functional characteristic of OLAP products is the use of MDB or RDB for data storage.

Client application.Data structured appropriately and stored in the MDB are available for analysis using the client application. The user gets the opportunity remote access to data, formulating complex queries, generating reports, obtaining arbitrary subsets of data. Getting a report comes down to choosing specific measurement values ​​and building a section of a hypercube. The cross section is determined by the selected measurement values. The data for the rest of the measurements are summarized.

OLAPon the client and on the server. Multidimensional data analysis can be carried out using various tools, which can be conditionally divided into client and server OLAP tools.

Client-side OLAP tools (such as Microsoft's Pivot Tables in Excel 2000 or Knosys' ProClarity) are applications that compute aggregate data and display it. At the same time, the aggregate data itself is contained in the cache inside the address space of such an OLAP tool.

If the source data is contained in a desktop DBMS, the calculation of aggregate data is performed by the OLAP tool itself. If the source of the initial data is a server DBMS, many of the client OLAP tools send SQL queries to the server and, as a result, receive aggregate data calculated on the server.

As a rule, OLAP functionality is implemented in statistical data processing tools and in some spreadsheets.

Many development tools contain class or component libraries that allow you to create applications that implement the simplest OLAP functionality (such as the Decision Cube components in Borland Delphi and Borland C++ Builder). In addition, many companies offer ActiveX controls and other libraries that provide similar functionality.

Client OLAP tools are used, as a rule, with a small number of dimensions (usually no more than six) and a small variety of values ​​​​of these parameters - since the resulting aggregate data must fit in the address space of such a tool, and their number grows exponentially with an increase in the number of dimensions.

Many OLAP client tools allow you to save the contents of the aggregate data cache as a file so that they do not have to be recomputed. However, this opportunity is often used to alienate aggregate data in order to transfer them to other organizations or for publication.

The idea of ​​storing a cache with aggregate data in a file has been further developed in server-side OLAP tools (for example, Oracle Express Server or Microsoft OLAP Services), in which saving and modifying aggregate data, as well as maintaining the storage containing them, is carried out by a separate application or process called OLAP server. Client applications can request such multidimensional storage and receive some data in response. Some client applications may also create such stores or update them according to changed source data.

The advantages of using server OLAP tools compared to client OLAP tools are similar to the advantages of using server DBMS compared to desktop ones: in the case of using server tools, the calculation and storage of aggregate data occur on the server, and the client application receives only the results of queries to them, which allows generally reduce network traffic, request execution time, and resource requirements consumed by the client application.

3.5 Technical aspects of multidimensional data storage

Multidimensionality in OLAP applications can be divided into three levels:

1. Multidimensional Data View– end-user tools that provide multidimensional visualization and data manipulation; the multidimensional representation layer abstracts from the physical structure of the data and treats the data as multidimensional.

    Multidimensional Processing- a tool (language) for formulating multidimensional queries (the traditional relational SQL language is unsuitable here) and a processor that can process and execute such a query.

    Multidimensional storage– means of physical organization of data that ensure the efficient execution of multidimensional queries.

The first two levels are mandatory in all OLAP tools. The third level, although widely used, is not required, since data for multidimensional representation can also be retrieved from ordinary relational structures. The multidimensional query processor, in this case, translates multidimensional queries into SQL queries that are executed by a relational DBMS.

In any data warehouse - both conventional and multidimensional - along with detailed data retrieved from operational systems, aggregated indicators (total indicators) are also stored, such as the sums of sales volumes by months, by product categories, etc. Aggregates stored explicitly for the sole purpose of speeding up query execution. After all, on the one hand, as a rule, a very large amount of data is accumulated in the storage, and on the other hand, analysts in most cases are interested not in detailed, but in generalized indicators. And if millions of individual sales had to be summed up each time to calculate the amount of sales for the year, the speed would most likely be unacceptable. Therefore, when loading data into a multidimensional database, all total indicators or part of them are calculated and saved.

However, the use of aggregated data is fraught with disadvantages. The main disadvantages are the increase in the amount of stored information (when new dimensions are added, the amount of data that makes up the cube grows exponentially) and the time it takes to load them. Moreover, the volume of information can increase by tens and even hundreds of times. For example, in one of the published standard tests, the full aggregation count for 10 MB of raw data required 2.4 GB, i.e. the data grew by a factor of 240!

The degree of increase in the amount of data when calculating aggregates depends on the number of cube dimensions and the structure of these dimensions, i.e. the ratio of the number of "parents" and "children" at different levels of the dimension. To solve the problem of storing aggregates, complex schemes are used that allow, when calculating far from all possible aggregates, to achieve a significant increase in the performance of query execution.

Both raw and aggregate data can be stored either in

relational, or in multidimensional structures. In this regard, three methods of storing multidimensional data are currently used:

MOLAP (Multidimensional OLAP) - source and aggregate data is stored in a multidimensional database. Storing data in multidimensional structures allows you to manipulate data as a multidimensional array, so that the speed of calculating aggregate values ​​is the same for any of the dimensions. However, in this case, the multidimensional database is redundant, since the multidimensional data completely contains the original relational data.

These systems provide a full cycle of OLAP processing. They either include, in addition to the server component, their own integrated client interface, or use external spreadsheet programs to communicate with the user.

ROLAP (Relational OLAP) - the original data remains in the same relational database where it was originally located. Aggregate data is placed in service tables specially created for their storage in the same database.

HOLAP (Hybrid OLAP) - The original data remains in the same relational database where it originally resided, while the aggregate data is stored in a multidimensional database.

Some OLAP tools support data storage only in relational structures, others only in multidimensional structures. However, most modern OLAP server tools support all three data storage methods. The choice of storage method depends on the volume and structure of the source data, the requirements for the speed of query execution, and the frequency of updating OLAP cubes.

3.6 Data mining (DataMining)

The term Data Mining refers to the process of searching for correlations, trends and relationships through various mathematical and statistical algorithms: clustering, regression and correlation analysis, etc. for decision support systems. At the same time, the accumulated information is automatically generalized to information that can be characterized as knowledge.

The basis of modern Data Mining technology is the concept of patterns that reflect the patterns inherent in subsamples of data and constituting the so-called hidden knowledge.

Patterns are searched by methods that do not use any a priori assumptions about these subsamples. An important feature of Data Mining is the non-standard and non-obviousness of the patterns being sought. In other words, Data Mining tools differ from statistical data processing tools and OLAP tools in that instead of checking the relationships that users presuppose

between data, based on the available data, they are able to independently find such relationships, as well as build hypotheses about their nature.

In general, the process of data mining (Data Mining) consists of three stages

    identification of patterns (free search);

    using the revealed patterns to predict unknown values ​​(predictive modeling);

    exception analysis, designed to identify and interpret anomalies in the patterns found.

Sometimes, an intermediate stage of checking the reliability of the patterns found between their discovery and use (validation stage) is explicitly singled out.

There are five standard types of patterns identified by Data Mining methods:

1.Association allows you to select stable groups of objects between which there are implicitly defined links. The frequency of occurrence of a single item or group of items, expressed as a percentage, is called prevalence. The low prevalence rate (less than one thousandth of a percent) suggests that such an association is not significant. Associations are written as rules: A=> B, where BUT - package, AT - consequence. To determine the importance of each resulting association rule, it is necessary to calculate a value called confidence BUT to AT(or relationship A and B). Confidence shows how often when BUT appears AT. For example, if d(A/B)\u003d 20%, this means that when buying a product BUT in every fifth case, a product is also purchased AT.

A typical example of the application of the association is the analysis of the structure of purchases. For example, when conducting a study in a supermarket, you can find that 65% of those who bought potato chips also take Coca-Cola, and if there is a discount for such a set, they buy cola in 85% of cases. Such results are valuable in the formation of marketing strategies.

2. Consistency - it is a method of identifying associations over time. In this case, rules are defined that describe the sequential occurrence of certain groups of events. Such rules are necessary for building scenarios. In addition, they can be used, for example, to form a typical set of previous sales that may lead to subsequent sales of a particular product.

3.Classification - generalization tool. It allows you to move from considering single objects to generalized concepts that characterize some collections of objects and are sufficient for recognizing objects belonging to these collections (classes). The essence of the process of forming concepts is to find patterns inherent in classes. Many different features (attributes) are used to describe objects. The problem of concept formation according to indicative descriptions was formulated by M.M. Bongart. Its solution is based on the application of two main procedures: training and verification. In the training procedures, a classifying rule is built based on the processing of the training set of objects. The verification procedure (examination) consists in using the obtained classifying rule to recognize objects from a new (examination) sample. If the test results are found to be satisfactory, then the learning process ends, otherwise the classifying rule is refined in the process of repeated learning.

4. Clustering - this is the distribution of information (records) from the database into groups (clusters) or segments with the simultaneous definition of these groups. In contrast to classification, the analysis here does not require pre-specifying classes.

5. Time series forecasting is a tool for determining trends in the attributes of the objects under consideration over time. Analysis of the behavior of time series allows you to predict the values ​​of the studied characteristics.

To solve such problems, various methods and algorithms of Data Mining are used. In view of the fact that Data Mining has developed and is developing at the intersection of such disciplines as statistics, information theory, machine learning, database theory, it is quite natural that most of the algorithms and methods of Data Mining have been developed based on various methods from these disciplines.

From the variety of existing data mining methods, the following can be distinguished:

    regression, dispersion and correlation analysis(implemented in most modern statistical packages, in particular, in the products of SAS Institute, StatSoft, etc.);

    analysis methods in a specific subject area, based on empirical models (often used, for example, in inexpensive financial analysis tools);

    neural network algorithms- a method of imitation of processes and phenomena that allows you to reproduce complex dependencies. The method is based on the use of a simplified model of the biological brain and lies in the fact that the initial parameters are considered as signals that are transformed in accordance with the existing connections between the "neurons", and the response of the entire network to the initial data is considered as the answer resulting from the analysis. Links in this case are created using the so-called network learning through a large sample containing both the original data and the correct answers. Neural networks are widely used to solve classification problems;

    fuzzy logic is used to process data with fuzzy truth values ​​that can be represented by a variety of linguistic variables. Fuzzy knowledge representation is widely used to solve classification and prediction problems, for example, in the XpertRule Miner system (Attar Software Ltd., UK), as well as in AIS, NeuFuz, etc.;

    inductive inferences allow you to get generalizations of the facts stored in the database. In the process of inductive learning, a specialist supplying hypotheses can participate. This method is called supervised learning. The search for generalization rules can be carried out without a teacher by automatically generating hypotheses. In modern software tools ah, as a rule, both methods are combined, and statistical methods are used to test hypotheses. An example of a system using inductive leads is the XpertRule Miner developed by Attar Software Ltd. (Great Britain);

    reasoning based on similar cases(method of "nearest neighbor") (Case-based reasoning - CBR) are based on a search in the database of situations whose descriptions are similar in a number of features to a given situation. The principle of analogy suggests that the results of similar situations will also be close to each other. The disadvantage of this approach is that it does not create any models or rules that generalize previous experience. In addition, the reliability of inferred results depends on the completeness of the description of situations, as in inductive inference processes. Examples of systems using CBR are: KATE Tools (Acknosoft, France), Pattern Recognition Workbench (Unica, USA);

    decision trees- a method of structuring a problem in the form of a tree graph, the vertices of which correspond to production rules that allow classifying data or analyzing the consequences of decisions. This method gives a visual representation of the system of classifying rules, if there are not very many of them. Simple tasks are solved using this method much faster than using neural networks. For complex problems and for some types of data, decision trees may not be appropriate. In addition, this method is characterized by the problem of significance. One of the consequences of hierarchical data clustering is the lack of a large number of training examples for many special cases, and therefore the classification cannot be considered reliable. Decision tree methods are implemented in many software tools, namely: C5.0 (RuleQuest, Australia), Clementine (Integral Solutions, UK), SIPINA (University of Lyon, France), IDIS (Information Discovery, USA);

    evolutionary programming– search and generation of an algorithm that expresses the interdependence of data, based on the initially specified algorithm, modified in the search process; sometimes the search for interdependencies is carried out among any certain types of functions (for example, polynomials);

limited search algorithms, computing combinations of simple logical events in subgroups of data.

3.7 IntegrationOLAPandDataMining

Online analytical processing (OLAP) and data mining (Data Mining) are two components of the decision support process. However, today most OLAP systems only focus on providing access to multidimensional data, and most data mining tools that work in the realm of patterns deal with one-dimensional data perspectives. To increase the efficiency of data processing for decision support systems, these two types of analysis should be combined.

The compound term "OLAP Data Mining" (multidimensional data mining) is now emerging to refer to such a combination.

There are three main ways to form "OLAP Data Mining":

    Cubing then mining. The possibility of performing intellectual analysis should be provided over any result of a query to a multidimensional conceptual representation, that is, over any fragment of any projection of the hypercube of indicators.

    Mining then cubing. Like data retrieved from a warehouse, mining results must be presented in hypercubic form for subsequent multidimensional analysis.

    Cubing while mining. This flexible integration method allows you to automatically activate the same type of intellectual processing mechanisms over the result of each step of multidimensional analysis (transition) between levels of generalization, extraction of a new hypercube fragment, etc.).

    Grade 11 [Text... them how part all systems ... docent ... Cheboksary, 2009. No. 10. S. 44 -49 ... . The authors- compilers: N. ... abstractslectures, ...

  • Teaching aid

    ... lectures. Training lectures mathematics. Writing abstractlectures lectures. Usage informationtechnologies ...

  • I k kondaurova s ​​v lebedev research activities of the future teacher of mathematics creative tasks in elementary mathematics and methods of its teaching

    Teaching aid

    ... lectures. Training lectures mathematics. Writing abstractlectures. Preparation of visual aids. Reading technique lectures. Usage informationtechnologies ...

  • M ONITORING MEDIA Modernization of vocational education March - August 2011

    Summary

    ... 11 .08.2011 "Dead Souls-2" IN RNIMU them ... 3,11 -3,44 . ... public lectures leaders... Cheboksary... and scribbling abstracts the audience - ... informationalsystems and technology. ... system education, says docent ... compilers ... parts real content ...

Analytical technologies of business processes

Business intelligence systems - Business Intelligence (BI) combine various tools and technologies for analyzing and processing enterprise-wide data. Based on these tools, BI systems are created, the purpose of which is to improve the quality of information for making managerial decisions.

BI includes software products of the following classes:

systems of operational analytical processing (OLAP);

· means of intellectual data analysis (DM);

Software products of each class perform a certain set of functions or operations using special technologies.

OLAP (On-Line Analytical Processing) - online analytical processing - is not the name of a specific product, but of a whole technology. The concept of OLAP is based on a multidimensional representation of data.

In 1993, the founder of the relational database approach Edgar Codd and partners (Edgar Codd, mathematician and IBM fellow), published a paper initiated by the company and entitled "Providing OLAP (Online Analytical Processing) for Analyst Users", in which 12 criteria of OLAP technology, which later became the main content of a new and very promising technology.

Later they were reworked into the FASMI test, which defines the requirements for OLAP products:

· FAST (fast). An OLAP application should provide a minimum access time to analytical data - on average, about 5 seconds;

· ANALYSIS (analysis). An OLAP application must allow the user to perform numerical and statistical analysis;

· SHARED (shared access). An OLAP application must provide the ability to work with information for many users at the same time;

· MULTIDIMENSIONAL (multidimensionality);

· INFORMATION (information). An OLAP application should enable the user to obtain the necessary information, no matter in which electronic data store it is located.

Based on FASMI, the following definition can be given: OLAP applications - these are systems for fast multi-user access to multidimensional analytical information with the capabilities of numerical and statistical analysis.

The main idea of ​​OLAP is to build multidimensional cubes that will be available for user queries. Multidimensional cubes (Figure 5.3) are built on the basis of source and aggregated data that can be stored in both relational and multidimensional databases. Therefore, there are currently three ways to store data: MOLAP (Multidimensional OLAP), ROLAP (Relational OLAP) and HOLAP (Hybrid OLAP).

Accordingly, OLAP products are divided into three similar categories according to the method of data storage:

1. In the case of MOLAP, the source and multidimensional data is stored in a multidimensional database or in a multidimensional local cube. This storage method provides high speed OLAP operations. But the multidimensional base in this case will most often be redundant. The cube built on its basis will strongly depend on the number of dimensions. As the number of dimensions increases, the volume of the cube will grow exponentially. Sometimes this can lead to "explosive growth" in the amount of data.

2. In ROLAP products, source data is stored in relational databases or in flat local tables on a file server. Aggregate data can be placed in service tables in the same database. The transformation of data from a relational database into multidimensional cubes occurs at the request of an OLAP tool. In this case, the speed of building a cube will greatly depend on the type of data source.

3. In the case of using a hybrid architecture, the source data remains in the relational database, while the aggregates are placed in the multidimensional one. An OLAP cube is built at the request of an OLAP tool based on relational and multidimensional data. This approach avoids explosive data growth. In this case, you can achieve the optimal execution time for client requests.

Using OLAP technologies, the user can perform flexible viewing of information, obtain various data slices, perform analytical operations of detailing, convolution, end-to-end distribution, comparison over time, i.e. produce and dynamically publish reports and documents.

The structure of the warehouse database is usually designed in such a way as to facilitate the analysis of information as much as possible. The data should be conveniently "layed out" in different directions (called dimensions). For example, today a user wants to see a summary of parts shipments by supplier to compare their performance. Tomorrow, the same user will need a picture of changes in the volume of deliveries of parts by months in order to track the dynamics of deliveries. The structure of the database should support these types of analyzes, allowing the extraction of data corresponding to a given set of measurements.

Operational analytical data processing is based on the principle of organizing information into a hypercubic model. The simplest three-dimensional data cube for the supply of parts for the previously considered test database is shown in fig. 3.11. Each of its cells corresponds to a "fact" - for example, the scope of delivery of a part. Along one side of the cube (one dimension) are the months during which the deliveries reflected by the cube were made. The second dimension is part types, and the third dimension corresponds to suppliers. Each cell contains the delivery quantity for the corresponding combination of values ​​across all three dimensions. It should be noted that when filling the cube, the aggregation of the values ​​for the deliveries of each month from the test database was performed.


3.11. Simplified Hypercube Variant for Parts Supply Analysis

OLAP class systems differ in the way data is presented.

Multidimensional OLAP (MOLAP) – these systems are based on a multidimensional data structure based on dynamic arrays with appropriate access methods. MOLAP is implemented on patented technologies for organizing multidimensional DBMS. The advantage of this approach is the convenience of performing calculations on hypercube cells, since under all combinations of measurements, the corresponding cells are entered (as in a spreadsheet). Classical representatives of such systems include Oracle Express, SAS Institute MDDB.

Relational OLAP (ROLAP)– supports multidimensional analytical models over relational databases. This class of systems includes Meta Cube Informix, Microsoft OLAP Services, Hyperion Solutions, SAS Institute Relational OLAP.

Desktop OLAP (Desktop OLAP)– tools for generating multidimensional queries and reports for local information systems (spreadsheets, flat files). The following systems can be distinguished - Business Objects, Cognos Power Play.

E.F. Codd defined twelve rules that an OLAP class product must satisfy, including multi-dimensional conceptual representation of data, transparency, accessibility, robust performance, client-server architecture, dimension equality, sparse matrix dynamic processing, multi-user support, unlimited support for cross-dimensional operations, intuitive data manipulation , flexible reporting mechanism, unlimited number of dimensions and aggregation levels.



The most common class systems ROLAP. They allow you to organize an information model over a relationally complete storage of any structure or over a special data mart.

Rice. 3.12. Parts Supply Analytical Mart Star Diagram

For most data warehouses, the most efficient way to model an N-dimensional cube is a "star". On fig. Figure 3.11 shows a hypercube model for parts supply analysis, in which information is consolidated across four dimensions (supplier, part, month, year). The star schema is based on a fact table. The fact table contains a column that specifies the scope of delivery, as well as columns that specify foreign keys for all dimension tables. Each cube dimension is represented by a value table that is a reference to the fact table. To organize the levels of generalization of information above the reference books of measurements, categorical inputs are organized (for example, "material-detail", "city-supplier").

The reason why the circuit in Fig. 3.12 is called a "star", is quite obvious. The ends of the "star" are formed by dimension tables, and their links to the fact table located in the center form rays. With this database design, most business analysis queries join a central fact table with one or more dimension tables. For example, a query to get shipments of all parts in 2004 by month, broken down by vendor, would look like this:

SELECT SUM(VALUE), SUPPLIER.SUPPLIER_NAME, FACT.MONTH_ID

FROM FACT, SUPPLIER

WHERE FACT.YEAR_ID=2004

AND FACT.SUPPLIER_CODE=SUPPLIER.SUPPLIER_CODE

GROUP_BY SUPPLIER_CODE, MONTH_ID

ORDER_BY SUPPLIER_CODE, MONTH_ID.

On fig. 3.13 shows a fragment of the report generated as a result of a given query.

Term operational analytical processing(On-Line Analytical Processing-OLAP) was first mentioned in a report prepared for Arbor Software Corp. in 1993, although the definition of this term, as in the case of data warehouses, was formulated much later. The concept denoted by this term can be defined as "an interactive process of creating, maintaining, analyzing data and issuing reports." In addition, it is usually added that the data in question should be perceived and processed in the same way as if they were stored in multidimensional array. But before we get into the discussion of multidimensional representation itself, let's look at the relevant ideas in terms of traditional SQL tables.

The first feature is that analytical processing necessarily requires some aggregation data, usually performed at once in several different ways, or in other words, according to many different grouping criteria. In fact, one of the main problems of analytical processing is that the number of possible ways of grouping

becomes too big very soon. However, users need to consider all or almost all such methods. Of course, this kind of aggregation is now supported in the SQL standard, but any given SQL query produces only one table as its result, and all rows in that resulting table have the same form and the same interpretation10 (at least that's how it works).

9 To quote from a book on data warehousing: "[Stop] normalizing... Trying to normalize any of the tables in a multidimensional database just to save disk space [that's right!] is a waste of time... Dimension tables should not be normalized... Normalized Dimension tables preclude viewing."

10 Unless this result table includes any NULL values ​​or NULL values ​​(see Chapter 19, Section 19.3, "More on Predicates" subsection). In fact, the SQL:1999 constructs that should be described in this section can be characterized as "based on the use" of this highly deprecated SQL facility (?); in fact, they emphasize the fact that null values ​​can have different meanings in their various manifestations, and therefore allow many different predicates to be represented in the same table (as will be shown below).

was before the advent of the SQL standard: 1999). Therefore, in order to implement P different ways of grouping, you need to perform P separate queries and create l separate tables as a result. For example, consider the following sequence of queries executed against a supplier and parts database.

1. Determine the total number of deliveries.

2. Determine the total number of deliveries by suppliers.

3. Determine the total number of deliveries by parts.

4. Determine the total number of deliveries by suppliers and parts.

(Of course, the "total" quantity for a given supplier and for a given part is simply the actual quantity for a given supplier and a given part. The example would be more realistic if a database of suppliers, parts, and projects were used. But to keep this simple example, we still settled on the usual database of suppliers and parts.)

Now suppose there are only two parts, numbered P1 and P2, and the supply table looks like this.

Multidimensional databases

So far, it has been assumed that OLAP data is stored in a regular database using the SQL language (apart from the fact that sometimes we still touched on terminology and the concept of multidimensional databases). In fact, we, without explicitly indicating, described the so-called system ROLAP(Relational OLAP- relational OLAP). However, many believe that the use of the system MOLAP(Multi-dimensional OLAP- multidimensional OLAP) - a more promising way. In this subsection, the principles of building MOLAP systems will be discussed in more detail.

The MOLAP system maintains multidimensional databases, in which data is conceptually stored in the cells of a multidimensional array.

Note. Although higher and was said about conceptual way of organizing storage, in fact, the physical organization of data in MOLAP very similar to their logical organization.

The supporting DBMS is called multidimensional. As a simple example you can cast a three-dimensional array representing, respectively, products, customers, and periods of time. The value of each individual cell can represent the total amount of the specified item sold to the customer in the specified time period. As noted above, the cross tables from the previous subsection can also be considered such arrays.

If there is a sufficiently clear understanding of the structure of the data set, then all relationships between the data can be known. Furthermore, variables such a collection (not in the sense of conventional programming languages), roughly speaking, can be divided into dependent and independent. AT previous example product, customer and period of time can be considered as independent variables, and amount - the only dependent variable. In general, independent variables are variables whose values ​​together determine the values ​​of dependent variables (just as, in relational terminology, a candidate key is a set

columns whose values ​​determine the values ​​of other columns). Therefore, the independent variables define the dimension of the array by which the data is organized, and also form addressing scheme11 for the given array. The dependent variable values, which represent the actual data, are stored in array cells.

Note. The difference between the values ​​of the independent, or dimensional, variables,

and the values ​​of the dependents, or oversized, variables are sometimes characterized as the difference between location and content.

" Therefore, array cells are addressed symbolically, rather than using numerical indexes, which are usually used to work with arrays.

Unfortunately, the above characterization of multidimensional databases is too simplistic, since most datasets initially remain not fully studied. For this reason, we usually seek to analyze the data in the first place in order to better understand it. Often the lack of understanding can be so significant that it is not possible to determine in advance which variables are independent and which are dependent. The explanatory variables are then chosen according to the current view of them (i.e., based on some hypothesis), after which the resulting array is checked to determine how well the explanatory variables were chosen (see section 22.7). This approach leads to the fact that many iterations are performed according to the principle of trial and error. Therefore, the system usually allows a change of dimensional and non-dimensional variables, and this operation is called change of coordinate axes(pivoting). Other supported operations include array transposition and reordering dimensions. There must also be a way to add dimensions.

By the way, it should be clear from the previous description that array cells often turn out to be empty (and the more dimensions, the more often this phenomenon is observed). In other words, arrays are usually sparse. Suppose, for example, that product p has not been sold to customer c during the entire period of time t. Then the cell [c, p, t] will be empty (or null at best). Multidimensional DBMSs support various methods for storing sparse arrays in a more efficient, concise way12. To this it should be added that empty cells correspond missing information and hence systems need to provide some computational support for empty cells. Such support is indeed commonly available, but unfortunately the style is similar to that of the SQL language. Pay attention to the fact that if this cell is empty, then the information is either not known, or has not been entered, or is not applicable, or is missing for other reasons.

(see chapter 19).

The independent variables are often related in hierarchy, defining the paths along which aggregation of dependent data can occur. For example, there is a temporary

a hierarchy linking seconds to minutes, minutes to hours, hours to days, days to weeks, weeks to months, months to years. Or another example: a hierarchy is possible

compositions connecting parts with a set of parts, sets of parts with a node, nodes with a module, modules with a product. Often the same data can be aggregated in many different ways, i.e. the same independent variable can belong to many different hierarchies. The system provides operators for passing up(drill up) and passing down(drill down) along such a hierarchy. Passing up means the transition from the lower level of aggregation to the upper one, and passing down -

transition in the opposite direction. There are other operations for working with hierarchies, such as an operation for reordering the levels of a hierarchy.

Note. Between operations passing up(drill up) and accumulation of results(roll

up) there is one subtle difference: the operation accumulation of results - is an implementation operation

12 Note the difference from relational systems. In the real relational counterpart of this example, the line ic,p, t) there would be no empty "cell" of quantity, due to the fact that the line (s, p, t) would simply be absent. Therefore, when using the relational model, unlike multidimensional arrays, there is no need to support "sparse arrays", or rather "sparse tables", and therefore no sophisticated compression methods are required to work with such tables.

required methods of grouping and aggregation, and the operation passing up- it's an operation access to the results of these methods. And an example of an operation passing down a query such as: "The total number of shipments is known; get the total data for each individual supplier" can serve. Of course, more detailed levels of data must be available (or computable) to answer this query.

Multidimensional database products also provide a number of statistical and other mathematical functions that help formulate and test hypotheses (ie, hypotheses about putative relationships). In addition, visualization and reporting tools are provided to help solve such problems. But, unfortunately, there is as yet no standard query language for multidimensional databases, although research is underway to develop a calculus on which such a standard could be based. But, unfortunately, there is nothing like the relational theory of normalization, which could serve as a scientific basis for designing multidimensional databases.

Concluding this section, we note that some products combine both approaches - ROLAP and MOLAP. Such hybrid OLAP system called HOLAP. There are extensive discussions to find out which of these three approaches is better, so it is worthwhile for us to try to say a few words on this issue13. In general, MOLAP systems provide faster calculations, but support smaller amounts of data compared to ROLAP systems, i.e. become less efficient as the amount of data increases. And ROLAP systems provide greater scalability, parallelism, and control than those of MOLAP systems. In addition, the SQL standard has recently been updated to include many statistical and analytical functions (see Section 22.8). It follows that ROLAP products are now also capable of providing extended functionality.

OLAP (Online Analytical Processing - online analytical processing) is an information process that allows the user to query the system, perform analysis, etc. online (online). Results are generated within seconds.

OLAP systems are made for end users, while OLTP systems are made for professional IS users. OLAP provides activities such as generating queries, querying ad hoc reports, performing statistical analysis, and building multimedia applications.

To provide OLAP, you need to work with a data warehouse (or multidimensional storage) as well as a toolkit, usually with multidimensional capabilities. These tools can be query tools, spreadsheets, data mining tools, data visualization tools, etc.

The concept of OLAP is based on the principle of multidimensional data representation. E. Codd considered the shortcomings of the relational model, first of all, pointing out the impossibility of combining, viewing and analyzing data in terms of multiple dimensions, that is, in the most understandable way for corporate analysts, and determined General requirements to OLAP systems that extend the functionality of relational DBMS and include multidimensional analysis as one of their characteristics.

12 rules that an OLAP class software product must satisfy. These rules:

1. Multidimensional conceptual representation of data.

2. Transparency.

3. Availability.

4. Steady performance.

5. Client - server architecture.

6. Equality of measurements.

7. Dynamic processing of sparse matrices.

8. Support multiplayer mode.

9. Unlimited support for cross-dimensional operations.

10. Intuitive data manipulation.

11. Flexible reporting mechanism.

12. Unlimited number of measurements and levels of aggregation.

The set of these requirements, which served as the actual definition of OLAP, should be considered as a recommendation, and specific products should be evaluated by the degree of approximation to ideally complete compliance with all requirements.


Data mining (Data Mining) and knowledge (Knowledge Mining). Management and analysis of large amounts of data (Big data). Business intelligence systems (Business Intelligence, BI).

Data mining (DMA) is a general term for data analysis with the active use of mathematical methods and algorithms (optimization methods, genetic algorithms, pattern recognition, statistical methods, Data Mining, etc.) that use the results of applying data visualization methods.

In general, the IAD process consists of three stages:

1) identification of patterns (free search);

2) using the revealed patterns to predict unknown values ​​(forecasting);

3) analysis of exceptions to identify and interpret anomalies in the patterns found.

Sometimes there is an intermediate stage of checking the reliability of the found patterns (validation stage) between their finding and use.

All IAD methods, according to the principle of working with initial data, are divided into two groups:

Use Case Analysis Reasoning Methods - Raw data can be stored in an explicit granular form and used directly for prediction and/or exception analysis. The disadvantage of this group of methods is the complexity of their use on large amounts of data.

Methods for identifying and using formalized patterns that require extracting information from primary data and converting it into some formal constructions, the form of which depends on a specific method.

Data Mining (DM) is a technology for discovering previously unknown non-trivial, practically useful and accessible knowledge in raw data, which is necessary for making decisions in various areas of human activity. The algorithms used in Data Mining require a large amount of calculations, which previously was a limiting factor in the wide practical application of these methods, but the increase in the performance of modern processors has removed the severity of this problem.

The Business Intelligence market consists of 5 sectors:

1. OLAP products;

2. Data mining tools;

3. Tools for building data warehouses and data marts (Data Warehousing);

4. Managerial Information Systems and applications;

5. End user tools for querying and reporting.

Currently, the leaders of corporate BI platforms include MicroStrategy, Business Objects, Cognos, Hyperion Solutions, Microsoft, Oracle, SAP, SAS Institute and others (Appendix B provides a comparative analysis of some of the functionality of BI systems).

Anna Ivanova

Today, almost any company, whether large or very small, private or public, uses information systems in its activities, and, as a rule, this is not the first year. This means that most enterprises already own a certain amount of accumulated data, and this amount is often of considerable value - at least it is confirmed by the fact that in recent years quite a lot of attention in the press has been paid to leaks of corporate data, which are considered as a profitable commodity. for the criminal market.

Note that the value of corporate data lies not only in the aggregate value of the individual records, but also in the often far greater value of the data set as a source of additional information that cannot be obtained from one or more records, such as information about patterns, trends or interdependencies between any data that allow you to make certain business decisions. That is why modern enterprise and supply chain management tools, banking information systems, and other business applications usually include not only data entry and editing tools, but also analytical processing tools that allow one way or another to identify and present patterns and trends in data. . Today these tools are very diverse. They include tools for building relational data warehouses - specially designed databases that allow you to quickly execute queries to select data; server and client tools for building multidimensional data warehouses containing aggregate data (sums, averages) in a non-relational structure; client applications for providing user interfaces to relational and multidimensional data stores; tools for creating solutions based on such storages, analyzing multidimensional and relational data, generating reports on multidimensional and relational data. Below we will discuss what the products of each of these categories are.

Data warehouses

Data Warehouse is usually called a database, the main purpose of which is the execution of analytical queries to select data. Data stores can be both relational and multidimensional.

Ralph Kimball, one of the originators of the data warehouse concept, described a data warehouse as "a place where people can access their data" (see, for example, Ralph Kimball, The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses, John Wiley & Sons, 1996 and The Data Webhouse Toolkit: Building the Web-Enabled Data Warehouse, John Wiley & Sons, 2000). He also formulated the basic requirements for data warehouses:

  • support for high-speed data retrieval from storage;
  • support for internal data consistency;
  • the ability to obtain and compare the so-called data slices (slice and dice);
  • availability of convenient utilities for viewing data in the storage;
  • completeness and reliability of stored data;
  • support for a quality data replenishment process.

It is often impossible to fulfill all the listed requirements within the framework of the same product. Therefore, for the implementation of data warehouses, several products are usually used, some of which are actually means of storing data, others are means of extracting and viewing them, others are means of replenishing them, etc.

Note that when designing warehouses, a priori assumptions are always made about the nature of the interdependencies of the data placed in them, and the benefits of using a data warehouse in making management decisions largely depend on the correctness of these assumptions.

Relational data stores

Unlike so-called online databases, which are used by applications that modify data, relational data stores are designed in such a way as to achieve the minimum execution time for read requests (online databases most often minimize the execution time for data modification requests). Typically, data is copied to storage from online databases according to a specific schedule.

The typical structure of a data warehouse is significantly different from the structure of a conventional relational DBMS. Typically, this structure is denormalized (which improves query performance) and can allow data redundancy. A typical data warehouse structure is shown in fig. 1. The main components of this structure are the fact table and the dimension tables.

Fact table(in the example in Figure 1 it is called Sales_Fact) is the main table of the data warehouse. As a rule, it includes information about objects or events, the totality of which will be further analyzed. Typically, such a table contains a unique composite key that combines the primary keys of the dimension tables. Most often, these are integer values ​​or values ​​of the "date / time" type - after all, a fact table can contain hundreds of thousands or even millions of records, and storing repeating text descriptions in it is usually unprofitable. In addition, the fact table contains one or more numeric fields, based on which aggregate data is obtained during the execution of analytical queries.

Note that there is no information in the fact table about how to group records when calculating aggregate data. This information is contained in dimension tables.

Dimension tables contain immutable or rarely changed data. They have at least one descriptive field and usually an integer key field (usually a surrogate key). Often (but not always), a dimension table may also contain fields that point to additional attributes that were in the original operational database, or to attributes responsible for grouping its own data. Each dimension table must be in a one-to-many relationship with the fact table.

Note that the growth rate of the dimension tables should be negligible compared to the growth rate of the fact table; for example, a new entry in the measurement table describing goods is added only when a new, previously unsold product appears.

Modern data design tools, such as the CA AllFusion Modeling Suite, typically include templates for designing data warehouses. It should be said that specialized DBMS are sometimes used to create relational data warehouses, data storage in which is optimized in terms of query execution speed. An example of such a product is Sybase Adaptive Server IQ, which implements an unconventional way to store data in tables. However, you can create storage in conventional relational DBMS.

OLAP and multidimensional data warehouses

Multidimensional data warehouses form the basis of OLAP tools (On-Line Analytical Processing) designed for complex multidimensional data analysis. The concept of OLAP was described in 1993 by E. F. Codd, the author of the relational data model, and OLAP support is currently implemented in many DBMS and data analysis tools.

Multidimensional stores typically contain aggregated data (for example, sums, averages, counts) for different samples. Most often, these aggregate functions form a multidimensional dataset called a cube, whose axes (called dimensions) contain the parameters, and whose cells contain the aggregate data that depends on them (sometimes called measures). Along each axis, data can be organized into hierarchies that reflect different levels of detail. As a rule, aggregate data is obtained by executing a series of queries for grouping data of the type:

Note that relational data warehouses often act as a data source for such queries. In this case, the dimension tables typically contain the source data for generating the cube dimensions, and the fact table contains the source data for calculating the cube measures.

Multidimensional data warehouses contain aggregate data of varying degrees of detail, such as sales by day, month, year, product category, etc. The purpose of storing aggregate data is to reduce query execution time, since in most cases, analysis and forecasts are not of interest. detailed rather than summary data. However, saving all aggregated data is not always justified - after all, when new dimensions are added, the amount of data that makes up the cube grows exponentially (sometimes they talk about "explosive growth" of the amount of data). To solve the problem of "explosive growth", various schemes are used that allow, when calculating far from all possible aggregate data, to achieve an acceptable speed of query execution.

Both source and aggregate data can be stored in either relational or multidimensional structures. Therefore, there are currently three ways to store data:

  • MOLAP (Multidimensional OLAP) - source and aggregate data are stored in a multidimensional database;
  • ROLAP (Relational OLAP) - the source data remains in the same relational database where they were originally located, while the aggregate data is placed in service tables specially created for their storage in the same database;
  • HOLAP (Hybrid OLAP) - the source data remains in the same relational database where it was originally located, while the aggregate data is stored in a multidimensional database.

Some OLAP tools support data storage only in relational structures, others only in multidimensional structures. However, most modern OLAP server tools support all three data storage methods. The choice of storage method depends on the volume and structure of the source data, the requirements for the speed of query execution, and the frequency of updating OLAP cubes.

The DBMS of the leading manufacturers released in recent years - IBM, Microsoft, Oracle, contain tools for creating multidimensional data warehouses (this tradition was started several years ago by Microsoft, which included an OLAP server in SQL Server 7.0). There are also separate products for creating OLAP storages - they are released by Hyperion, Sybase, Business Objects and some others.

data mining

The term Data Mining (mining in English means "mining") refers to the process of searching for correlations, trends, relationships and patterns between data using various mathematical and statistical algorithms: clustering, creating subsamples, regression and correlation analysis. Examples of the information sought can be information about which categories of buyers most often purchase a particular product, what part of the buyers of one specific product purchases another specific product, which category of customers most often does not pay the loan provided on time. This kind of information is usually used in forecasting, strategic planning, risk analysis, and its value for the enterprise is very high.

Note that traditional mathematical statistics and OLAP tools are not always suitable for solving such problems. Usually, statistical methods and OLAP are used to test pre-formulated hypotheses, but often it is the formulation of a hypothesis that turns out to be the most difficult task when conducting business analysis for subsequent decision making, since not all patterns in the data are obvious at first glance.

The basis of modern Data Mining technology is the concept of patterns that reflect the patterns inherent in data subsamples. The search for patterns is performed by methods that do not use any initial assumptions about these subsamples. If statistical analysis or OLAP application usually formulates questions like "What is the average number of bank customers who did not repay the loan on time among unmarried men from 40 to 50 years old?", then the use of Data Mining, as a rule, implies answers to questions like "There are Is it a typical category of clients who do not repay loans on time?". At the same time, it is the answer to the second question that often ensures the adoption of a successful business decision.

An important feature of Data Mining is the non-standard and non-obviousness of the patterns being sought. In other words, Data Mining tools differ from statistical data processing tools and OLAP tools in that, instead of checking interdependencies that users presuppose, they are able to find such interdependencies on the basis of available data on their own and build hypotheses about their nature. However, the use of Data Mining tools does not exclude the use of statistical tools and OLAP tools, since the results of data processing using the latter, as a rule, contribute to a better understanding of the nature of the patterns that should be sought. Therefore, there are Data Mining tools that can search for patterns, correlations, and trends in both relational and multidimensional data stores.

Usually, there are five standard types of patterns identified by Data Mining methods:

  • association - a high probability of connecting events with each other (for example, alpine skiing is often purchased with ski boots);
  • sequence - a high probability of a chain of events related in time (for example, within a certain period after the purchase of a printer, consumables for it will be purchased with a high probability);
  • classification - there are signs that characterize the group to which this or that event or object belongs (usually, certain rules are formulated based on the analysis of already classified events);
  • clustering - a pattern similar to classification and differing from it in that the groups themselves are not set at the same time - they are detected automatically during data processing;
  • temporal patterns - the presence of patterns in the dynamics of the behavior of certain data ( typical example- seasonal fluctuations in demand for certain goods or services) used for forecasting.

Today, there are a fairly large number of various data mining methods, among which the following can be distinguished.

Regression, dispersion and correlation analysis- implemented in most modern statistical packages, in particular, in the products of SAS Institute, StatSoft, etc.

Methods of analysis in a specific subject area based on empirical models. Often used, for example, in inexpensive financial analysis tools.

Neural network algorithms, the idea of ​​which is based on the analogy with the functioning of the nervous tissue and lies in the fact that the initial parameters are considered as signals that are transformed in accordance with the existing connections between the "neurons", and the response (the result of the analysis) is the response of the entire network to the initial data. Links in this case are created using the so-called network learning through a large sample containing both the original data and the correct answers.

Nearest Neighbor Method- selection of a close analogue of the initial data from the already available accumulated data.

Decision Trees- a hierarchical structure based on a set of questions, implying the answer "yes" or "no"; although this method of data processing does not always ideally find existing patterns, it is quite often used in forecasting systems due to the clarity of the response (Fig. 3).

Bounded enumeration algorithms- calculate the frequencies of combinations of simple logical events in subgroups of data.

evolutionary programming- search and generation of an algorithm that expresses the interdependence of data, based on the initially specified algorithm, modified in the search process; sometimes the search for interdependencies is carried out among any certain types of functions (for example, polynomials).

Data mining tools traditionally belong to expensive software tools - the price of some of them reaches several tens of thousands of dollars. Therefore, until recently, the main consumers of this technology were banks, financial and insurance companies, large trading enterprises, and the main tasks requiring the use of Data Mining were considered to be the assessment of credit and insurance risks and the development of a marketing policy, tariff plans and other principles of working with clients. In recent years, the situation has changed somewhat: relatively inexpensive Data Mining tools from several manufacturers (including Microsoft) have appeared on the software market, which made this technology available to small and medium-sized businesses that had not thought about it before.

Visualization tools for OLAP data and Data Mining results

Universal visualization tools for OLAP data are released by many companies, such as Business Objects, Cognos, Panorama, ProClarity. As a rule, these tools are designed for users who have some knowledge of databases and statistical methods of analysis. Typically, such tools allow you to access data warehouses and OLAP sources from various vendors (for example, multidimensional storages based on Oracle, Microsoft, and IBM DBMS), obtain slices of multidimensional data, and build diagrams based on them. Often, vendors of these tools also provide middleware servers to perform data analysis and provide results for display in client applications, as well as tools for creating solutions based on client tools and middleware servers (for example, class libraries or ActiveX controls). Considering that the situation with business intelligence standards is still far from ideal (unlike relational DBMS, for multidimensional DBMS there is neither a generally accepted standard for a query language similar to SQL, nor universal data access mechanisms similar to ODBC or OLEDB) , the use of such tools can, to one degree or another, solve the problem of creating analytical applications in companies using DBMS and OLAP tools from several different manufacturers.

Manufacturers of OLAP tools, such as Oracle and IBM, often release client applications designed for users to access OLAP repositories created on the basis of their own server tools. So, the Oracle Corporation even has several of these products, combined in the Oracle Business Intelligence package. In addition, spreadsheet add-ons for visualizing OLAP data have recently become widespread. So, means of displaying data of analytical services Microsoft SQL Server are available to users of Microsoft Excel 2000 and later versions, and Oracle and Hyperion release additional access modules built into the same Excel to access their own OLAP storages.

It is worth noting the expansion in recent years of the range of analytical products focused on serving certain industries (for example, retail or wholesale trade, financial services). They are produced by the companies listed above, and a number of other manufacturers, in particular, suppliers of enterprise management systems and other industry-specific business applications.

Report generation tools

A report is a document whose content is dynamically generated based on the information contained in the database. There are a lot of reporting tools on the software market now: both standalone products, and included in the application development tools or DBMS, and implemented as either server services or client applications. As a rule, reporting tools support a wide range of universal data access mechanisms (ODBC, OLE DB, ADO.NET), often - tools for direct access to the most popular DBMS using their client APIs, contain tools business graphics, integrate with office applications, allow you to publish reports on the Internet, include classes or components designed to create applications that implement (along with other features) report generation.

The undisputed market leader in reporting tools is Crystal Reports, owned by Business Objects. It is available both separately and as part of products from other manufacturers, from application development tools to geographic information systems. There is also a server version of this product designed to provide reports to a large number of users. In addition to Crystal Reports, there are several less popular products of this class.

Conclusion

In this review, we reviewed the main technologies that underlie modern analytical applications. As you can see, the choice of both technologies and products that implement them is quite wide, especially considering the fact that such tools are available in modern server DBMS and a wide range of client tools for visualizing analysis results and creating analytical reports. However, the lack of generally accepted standards in the field of business intelligence yet creates certain problems when creating solutions that use it.

Theme 6

CORPORATE INFORMATION SYSTEMS FOR PROCESSING ECONOMIC INFORMATION

The concept of corporate information technology

Essence and significance of corporate information technologies

Among the variety of programs for business, the term "information technology in corporate governance" is traditionally understood as "complex management automation systems." Their other names are also known - enterprise-wide systems, corporate information systems (CIS), corporate (or complex) control systems (CMS), automated control systems (ACS).

As a rule, complex control automation systems are "basic" universal solutions suitable for various types enterprises. First of all, this concerns financial management, inventory management, purchase and sales management. But these same systems often have industry-specific solutions that reflect this or that specificity and contain the appropriate regulatory and reference base.

For example, the SAP R/3 system solution for the aviation industry supports the accounting and control of serial factory numbers of all aircraft parts, their service life, scheduled replacement or repair, which ensures not only production reliability, but also passenger safety.

Since complex control systems are primarily focused on large enterprises containing multi-profile structures, they not only offer a developed set of functions, but also provide reliable storage and processing of large amounts of information, using powerful platforms and system tools for multi-user work. .

Modern information technologies, communications and the Internet allow solving the problems of remote access to a single database, which is also relevant for corporate governance.

Building concept

Although the majority of developers call their software products management (enterprise, warehouse, finance, etc.), in essence, almost all software used in corporate governance are recording facts and documents of financial and economic activities, accounting systems with the ability to construction of reports and references in sections, admissible by analytical signs. That is, structured information is entered into the database. This structure is laid down to some extent by reference books, classifiers, parameters and forms of standard documents that are interconnected. According to the information available in the database, the so-called "cut" is "built", "drawn", "assembled" by tools. Having received reports and certificates based on such data, often called analytical ones, management can make decisions. This is the typical concept and typical technology for working with systems of the class under consideration.



It is no coincidence that such “management” software, which is different in terms of functionality, system solutions, purpose and use, such as “Galaktika”, “BEST” and “1C: Enterprise”, is similar in terms of the principles of organizing information, the technology of its formation and processing, as well as methods of interaction with systems.

Nevertheless, enterprises, for example, OAO Uralelektromed, put forward such stringent and diverse requirements for corporate governance tools that it becomes necessary to build them on a multi-level basis. Usually the core of the system is central, containing only programming codes. The next conceptually important element is the built-in toolkit of the system, which allows, without changing the program codes, at least to configure it at the workplace, perform specific operations, enter new and change existing forms of primary and reporting documents, and use other parametric settings. More developed systems have built-in tools for creating various enterprise models: informational, organizational, functional, etc. And, finally, the database itself.

Analytical processing of information

Planning the activities of an enterprise, obtaining operational information and making the right decision based on its analysis is associated with the processing of large amounts of data. Reports generated in accounting corporate management systems are usually not flexible. They cannot be “twisted”, “expanded” or “collapsed” to get the desired representation of the data, including graphical representation. The more “cuts” and “cuts” you can make, the more realistic you can imagine the picture of the enterprise’s activities and make the best decision on business process management. Such tasks require mathematical and economic modeling, as well as high speed. The analytical module is available in the "RepCo" system, the "Triumph-Analytics" system (Corporation "PARUS" - "Tora Center") is better known. It would seem that accounting systems build certificates in various “sections” according to the information stored in the database, they simply represent what is. And analytical systems build new information according to given parameters or criteria, optimizing it for specific purposes. Therefore, a special tool for viewing and visualizing information is often needed, which is “online data analysis” (OLAP - online analytical processing). It provides a set of convenient and fast tools for accessing, viewing and multidimensional analysis of information accumulated in the repository.

OLAP-technologies are used to model the situation according to the “what will happen if…” scheme, to compile various analytical reports. There are specialized Western software products.

Usually, information is transferred from corporate management systems to specialized programs for analytical data processing. Many domestic developers are trying to solve these problems on their own, for example, the companies Nikos-Soft (NS-2000 system), Cepheus (corporate management system Etalon), COMSOFT (software-methodological and tool complex COMSOFT-STANDARD "2.0) and others.

6.4. Prospects for the development and use of corporate information technologies

In addition to the development and use of modern tools and platforms, as well as system tools, the development of domestic corporate systems implies their functional saturation, especially in terms of production.

Despite the widespread enthusiasm for the implementation of management standards, the leading players in the domestic software market are developing industry solutions for various types of industries.

The fears of firms to open the "confidentiality" of their developments are reduced, which contributes to the consolidation of their efforts to integrate their products, rather than developing everything from "a" to "z" on their own. Today, no one has enough resources. It takes years to comprehend a new concept, develop a project and a system, namely a system that changes its quality depending on what it contains. In addition, the requirement to integrate software products is also put forward by enterprises that want to keep "working", as a rule, specialized systems and informationally combine them with newly acquired ones.

Integration is also required for products from different manufacturers - in the name of combining complex solutions with specialized ones:

– budgeting, financial and economic analysis, customer service, analytical data processing, etc.

It should be noted that it is not the control systems themselves that are more promising, but a simple and universal tool for their creation, intended for qualified intermediaries between the developer and the end user. Now system administrators and analysts are trying to perform these functions.

If such a tool is available, "ready-made" standard solutions for all enterprises of all industries will be in demand.

The Internet as an additional tool for business development can be effectively used only if there is an integrated management system.

Although modern information and communication technologies, including the Internet, make it possible to organize the rental of software, it is premature to talk about the short-term use of such opportunities, especially in our country. And not so much for reasons of confidentiality, but because of the lack of order and reliable means of communication.

Attempts to introduce and experience in using, even not in full, information technologies at domestic enterprises have proven in practice that "it is impossible to automate chaos." A preliminary reorganization of the business and the enterprise itself is necessary, as well as the construction of management regulations (instructions). It is difficult for employees of the enterprise to cope with such work on their own. Especially given the time factor in market conditions. Therefore, the practice of interaction with consulting companies is developing everywhere, which help enterprises and teach their employees to "expand bottlenecks", establish the main business process, develop technology, build information flows, etc. Automating an established process is easier, simpler, cheaper, faster.

Everyone must do their job. An accountant, storekeeper, sales manager and other "subject" specialists should not improve the form of document forms, push columns apart or change their places due to changes in legislation or business patterns. Therefore, the software market is gradually turning from a "grocery" into a "servicing" one. Outsourcing begins to develop - the transfer of some functions of the enterprise to specialists of involved companies. They are engaged in maintenance of equipment, system software, modification of the applied (functional) part of systems, etc.

The most important and relevant in the use of corporate management systems is the information technology and methodological service for their users and consumers.

(DBMS. - 1998. - No. 4-5)

The modern level of development of hardware and software has made it possible for some time now to maintain databases of operational information at all levels of management. In the course of their activities, industrial enterprises, corporations, departmental structures, public authorities and administrations have accumulated large amounts of data. They contain great potential for extracting useful analytical information, on the basis of which you can identify hidden trends, build a development strategy, and find new solutions.

In recent years, a number of new concepts for storing and analyzing corporate data have taken shape in the world:

This article is devoted to an overview of these concepts, as well as to the proof of their complementarity in the matter of supporting managerial decision-making.

1. Storage (warehouses) of data

In the field of information technology, two classes of systems have always coexisted [, p. 49]:

At the first stages of informatization, it is always necessary to put things in order in the processes of everyday routine data processing, which is what traditional DOD is focused on, therefore, the advanced development of this class of systems is quite understandable.

Systems of the second class - DSS - are secondary to them. A situation often arises when data in an organization is accumulated from a number of unrelated ODS, largely duplicating each other, but not being coordinated in any way. In this case, it is practically impossible to obtain reliable comprehensive information, despite its apparent excess.

The purpose of building a corporate data warehouse is to integrate, update and harmonize operational data from heterogeneous sources to form a single consistent view of the management object as a whole. At the same time, the concept of data warehouses is based on the recognition of the need to separate data sets used for transactional processing and data sets used in decision support systems. Such a separation is possible by integrating detailed data separated in the SOD and external sources in a single repository, their coordination and, possibly, aggregation. W. Inmon, the author of the concept of data warehouses, defines such stores as:

  • "subject-oriented,
  • integrated,
  • immutable,
  • supporting chronology

datasets organized to support management" designed to act as a "single and only source of truth" providing managers and analysts with reliable information necessary for operational analysis and decision support.

The concept of data warehouses involves not just a single logical view of the organization's data, but the actual implementation of a single integrated data source. An alternative way to form a unified view of corporate data in relation to this concept is to create a virtual source based on distributed databases of various SODs. At the same time, each request to such a source is dynamically translated into queries to the source databases, and the results obtained are coordinated, linked, aggregated and returned to the user on the fly. However, with outward elegance, this method has a number of significant drawbacks.

  1. The processing time for requests to the distributed storage is significantly higher than the corresponding figures for the centralized storage. In addition, SOD database structures, designed for intensive updating of single records, are highly normalized, therefore, in an analytical query, they require a join of a large number of tables, which also leads to slow performance.
  2. An integrated view of distributed enterprise storage is possible only if the requirement of constant communication of all data sources in the network is met. Thus, the temporary unavailability of at least one of the sources can either make the operation of the information and analytical system (IAS) impossible or lead to erroneous results.
  3. Execution of complex analytical queries on DOD tables consumes a large amount of database server resources and leads to a decrease in DOD performance, which is unacceptable, since the execution time of operations in DOD is often very critical.
  4. Different ODS may support different data formats and encodings, the data in them may be inconsistent. Very often, several answers can be received for the same question, which may be due to the non-synchronism of data update moments, differences in the interpretation of individual events, concepts and data, changes in data semantics in the process of developing the subject area, input errors, loss of fragments archives, etc. In this case, the goal - the formation of a single consistent view of the control object - may not be achieved.
  5. The main disadvantage should be recognized as the practical impossibility of reviewing long-term historical sequences, because in the absence of a central repository, only those data are available that at the time of the request are in real databases of related SODs. The main purpose of SOD is online data processing, so they do not have the luxury of storing data for a long (more than a few months) period; as the data becomes obsolete, it is uploaded to the archive and deleted from the transactional database. As for analytical processing, it is just the most interesting look at the control object in a historical retrospective.

Thus, the data warehouse operates according to the following scenario. According to a given regulation, it collects data from various sources - databases of online processing systems. The storage supports chronology: along with the current data, historical data is stored with an indication of the time to which they refer. As a result, the necessary available data about the control object is collected in one place, brought to a single format, agreed and, in some cases, aggregated to the minimum required level of generalization.

A lightweight version of the corporate data warehouse can be data marts (Data Mart), that is, thematic databases containing information related to certain aspects of the organization's activities. The concept of data marts was proposed by Forrester Research in 1991. At the same time, the main idea was that data marts contain thematic subsets of pre-aggregated data, which are much smaller in size than a corporate data warehouse, and, therefore, require less productive equipment to maintain. In 1994, M. Demarest proposed to combine the two concepts and use the data warehouse as a single integrated source for multiple data marts. In this version, the corporate information and analytical system has a three-level structure:

  • corporate-wide centralized data storage;
  • thematic data marts at the departmental level;
  • end-user workstations equipped with analytical tools.

The considered concept is focused exclusively on storage, and not on the processing of corporate data. It does not predetermine the architecture of target analytical systems, but only creates a field for their functioning, concentrating on data requirements. Thus, it leaves freedom of choice in everything related to:

  • ways of presenting data in the target storage (for example, relational, multidimensional);
  • storage data analysis modes.

2. Methods of analytical data processing

In order for the existing data warehouses to contribute to the adoption of management decisions, the information must be presented to the analyst in the right form, that is, he must have developed tools for accessing and processing data from the warehouse.

According to the criterion of data analysis mode, information and analytical systems are divided into two categories [ , ]:

  • static (including a predefined set of data processing and reporting scenarios); this category includes the so-called executive information systems (ISS);
  • dynamic (supporting the construction and execution of ad hoc queries and the generation of reports of arbitrary form).

Very often, IAS, created with the expectation of direct use by decision makers, are extremely easy to use, but severely limited in functionality. Such static DSS [ , P. 55], or Executive Information Systems (ISS) [ , P. 73] - (Executive Information Systems, EIS) [ , P. 4] - contain predefined sets of queries and, being sufficient for everyday review, are unable to answer all questions about the available data that may arise when making decisions (NOTE. According to the definition of V. Przyjalkowski [, p. 81], IRS is "a computer system that allows ... to provide information to senior management personnel with limited computer experience). The result of the work of such a system, as a rule, are multi-page reports, after careful study of which the analyst has a new series of questions; however, each new request, not foreseen when designing such a system, must first be formally described, transferred to the programmer, coded, and only then executed. The waiting time in this case can be hours and days, which is not always acceptable. Thus, the external simplicity of static DSS, for which most customers of information and analytical systems are actively fighting, turns into a catastrophic loss of flexibility.

Dynamic DSS, on the other hand, are focused on processing unregulated, unexpected (ad hoc) analysts' requests for data. The most profound requirements for such systems were considered by E. F. Codd in the article that laid the foundation for the concept of OLAP. Analysts work with these systems in an interactive sequence of generating queries and studying their results, each of which can generate the need for a new series of queries.

But dynamic DSS can operate not only in the field of online analytical processing (OLAP); Support for making managerial decisions based on accumulated data can be performed in three basic areas.

According to Codd, a multi-dimensional conceptual view is the most natural view of management personnel on the object of management. It is a multiple perspective, consisting of several independent dimensions along which certain sets of data can be analyzed. Simultaneous analysis on multiple dimensions of data is defined as multivariate analysis. Each dimension includes directions of data consolidation, consisting of a series of successive levels of generalization, where each higher level corresponds to a greater degree of data aggregation for the corresponding dimension. Thus, the dimension Contractor can be determined by the direction of consolidation, consisting of levels of generalization "enterprise - subdivision - department - employee". The Time dimension can even include two consolidation directions - "year - quarter - month - day" and "week - day", since the time counting by months and by weeks is not compatible. In this case, it becomes possible to arbitrarily select the desired level of information detail for each of the measurements. The operation of descent (drilling down) corresponds to the movement from the higher levels of consolidation to the lower ones; on the contrary, the operation of lifting (rolling up) means moving from lower levels to higher levels (Fig. 2).


Rice. 2. Measurements and directions of data consolidation.

3.1. Requirements for online analytical processing tools

Codd identified 12 rules that an OLAP class software product must satisfy (Table 1).

Table 1. Rules for evaluating OLAP software products.

1. Multi-Dimensional Conceptual View The conceptual representation of a data model in an OLAP product should be multidimensional in nature, that is, it should allow analysts to perform intuitive "slice and dice" operations - translated by S. D. Kuznetsov, speech at the 3rd annual conference "Corporate database "98"), rotation (rotate) and placement (pivot) of consolidation directions.
2. Transparency The user does not need to know what specific means are used to store and process data, how the data is organized and where it comes from.
3. Accessibility The analyst must be able to perform analysis within a common conceptual schema, but the data can remain under the control of the legacy DBMS, while being tied to a common analytical model. That is, the OLAP toolkit must impose its logical schema on physical data sets, performing all the transformations required to provide a single, consistent, and holistic user view of the information.
4. Consistent Reporting Performance As the number of dimensions and database sizes increase, analysts should not experience any performance degradation. Sustained performance is essential to maintain the ease of use and freedom from the complexity required to bring OLAP to the end user.
5. Client - server architecture (Client-Server Architecture) Most of the data that requires online analytical processing is stored in mainframe systems and retrieved from personal computers. Therefore, one of the requirements is the ability of OLAP products to work in a client-server environment. The main idea here is that the server component of the OLAP tool should be intelligent enough to be able to build a common conceptual schema based on the generalization and consolidation of the various logical and physical schemas of enterprise databases to provide a transparent effect.
6. Equality of measurements (Generic Dimensionality) All data dimensions must be equal. Additional features can be given to individual dimensions, but since they are all symmetrical, this additional functionality can be given to any dimension. The underlying data structure, formulas, and reporting formats should not rely on any one dimension.
7. Dynamic Sparse Matrix Handling An OLAP tool must provide optimal handling of sparse matrices. The access rate must be maintained regardless of the location of the data cells and be a constant value for models with a different number of dimensions and different data sparseness.
8. Multi-User Support Often, several analysts need to work simultaneously on the same analytical model or create different models based on the same corporate data. The OLAP tool must provide them with concurrent access, ensure the integrity and protection of data.
9. Unrestricted Cross-dimensional Operations Computing and manipulating data over any number of dimensions should not prohibit or restrict any relationship between data cells. Transformations that require an arbitrary definition must be specified in a functionally complete formulaic language.
10. Intuitive Data Manipulation Reorientation of consolidation directions, data detailing in columns and rows, aggregation and other manipulations inherent in the structure of the hierarchy of consolidation directions should be performed in the most convenient, natural and comfortable user interface.
11. Flexible reporting mechanism (Flexible Reporting) Various ways of visualizing data should be supported, that is, reports should be presented in any possible orientation.
12. Unlimited Dimensions and Aggregation Levels It is highly recommended that every serious OLAP tool has at least fifteen, and preferably twenty, dimensions in the analytical model. Moreover, each of these dimensions must allow for a virtually unlimited number of user-defined levels of aggregation across any direction of consolidation.

The set of these requirements that served as the de facto definition of OLAP has been criticized quite often. So, it says that within the 12 requirements are mixed:

  • actual requirements for functionality (1, 2, 3, 6, 9, 12);
  • informal wishes (4, 7, 10, 11);
  • requirements for the architecture of an information system that have a very approximate relationship to functionality (5, 8); for example, according to requirement 5, a system implemented on the basis of a UNIX server with terminals cannot be an OLAP product, since it does not work in a client-server architecture; also, an OLAP product cannot be a desktop single-user system, since requirement 8 is violated in this case.

On the other hand, according to Codd himself, none of the operational data analysis products currently on the market fully satisfies all the requirements put forward by him. Therefore, 12 rules should be considered as recommendations, and specific products should be evaluated by the degree of approximation to ideally full compliance with all requirements.

3.2. Classification of OLAP products according to the way data is presented

There are currently about 30 products on the market that provide some degree of OLAP functionality (according to the Web review server http://www.olapreport.com as of February 1998). Providing a multidimensional conceptual view from the user interface to the source database, all OLAP products are divided into three classes according to the type of the source database.

In addition to the listed tools, there is another class - desktop query and report generation tools, supplemented with OLAP functions and / or integrated with external tools that perform such functions. These rather advanced systems fetch data from the original sources, transform it and place it in a dynamic multidimensional database that operates on the end user's client station. For working with small, simply organized databases, these tools are best suited. The main representatives of this class are BusinessObjects of the company of the same name, BrioQuery by Brio Technology [ , p. 34] and PowerPlay by Cognos [ , pp. 34-35].

3.2.1. Multidimensional OLAP (MOLAP)

In specialized DBMSs based on multidimensional data representation, data is organized not in the form of relational tables, but in the form of ordered multidimensional arrays:

  • hypercubes (all cells stored in the database must have the same dimension, that is, be in the most complete basis of measurements) or
  • polycubes (each variable is stored with its own set of measurements, and all the processing difficulties associated with this are shifted to the internal mechanisms of the system).

The use of multidimensional databases in online analytical processing systems has the following advantages.

On the other hand, there are significant limitations.

Therefore, the use of multidimensional DBMS is justified only under the following conditions.

  1. The amount of initial data for analysis is not too large (no more than a few gigabytes), that is, the level of data aggregation is quite high.
  2. The set of information dimensions is stable (because any change in their structure almost always requires a complete restructuring of the hypercube).
  3. The system response time to ad hoc requests is the most critical parameter.
  4. It requires extensive use of complex built-in functions to perform cross-dimensional calculations on hypercube cells, including the ability to write custom functions.
3.2.2. Relational OLAP (ROLAP)

The direct use of relational databases as initial data in online analytical processing systems has the following advantages.

  1. For online analytical processing of the contents of data warehouses, ROLAP tools allow you to perform analysis directly on the warehouse (because in the vast majority of cases, corporate data warehouses are implemented using relational DBMS).
  2. In the case of a variable dimension of the problem, when changes to the dimension structure have to be made quite often, ROLAP systems with a dynamic dimension representation are the optimal solution, since such modifications in them do not require physical reorganization of the database.
  3. ROLAP systems can operate on much less powerful client stations than MOLAP systems, since the main computational load in them falls on the server, where complex analytical SQL queries generated by the system are executed.
  4. Relational DBMS provide a much higher level of data protection and access rights.
  5. Relational DBMSs have real experience with very large databases and advanced administration tools.

The disadvantages of ROLAP systems have already been mentioned when listing the advantages of using multidimensional databases. These are, firstly, limited possibilities in terms of calculating values ​​of a functional type, and secondly, lower performance. To ensure performance comparable to MOLAP, relational systems require careful study of the database schema and special tuning of indexes. But as a result of these operations, the performance of well-tuned relational systems using the star schema is quite comparable to the performance of systems based on multidimensional databases.

The description of the star schema and recommendations for its application are completely devoted to the works [ , , ]. Its idea is that there are tables for each dimension, and all the facts are placed in one table, indexed by a multiple key made up of the keys of individual dimensions. Each ray of the star scheme specifies, in Codd's terminology, the direction of data consolidation along the corresponding dimension (for example, Store - City / District - Region).

In the general case, facts have different sets of dimensions, and then it is convenient to store them not in one, but in several tables; in addition, in various user requests, only a part of the possible measurements may be of interest. But with this approach, with a large number of independent dimensions, it is necessary to maintain many fact tables corresponding to each possible combination of dimensions selected in the query, which leads to wasteful use of external memory, an increase in the time of loading data into the star schema database from external sources, and administration difficulties. To solve this problem, the authors of the work propose a special extension for the SQL language ("GROUP BY CUBE" operator and the "ALL" keyword) (NOTE: This extension has not yet been accepted, so this proposal is of purely academic interest.), and the authors of [ , ] recommend creating fact tables not for all possible combinations of dimensions, but only for the most complete ones (those whose cell values ​​cannot be obtained by subsequent cell aggregation of other database fact tables).

In complex tasks with multilevel measurements, it makes sense to refer to the extensions of the star schema - the constellation schema (fact constellation schema) [ , pp. 10-11] and the snowflake schema (snowflake schema) [ , pp. 13-15]. In these cases, separate fact tables are created for possible combinations of summarization levels of different dimensions. This provides the best performance, but often results in data redundancy.

In any case, if the multidimensional model is implemented as a relational database, you should create long and "narrow" fact tables and relatively small and "wide" dimension tables. The fact tables contain the numerical values ​​of the cells of the hypercube, and the remaining tables define the multidimensional basis of dimensions containing them.

Orientation towards the representation of multidimensional information using star-shaped relational models makes it possible to get rid of the problem of optimizing the storage of sparse matrices, which is acute for multidimensional DBMSs (where the problem of sparseness is solved by a special choice of scheme). Although a whole record is used to store each cell in the fact table (which, in addition to the values ​​themselves, includes secondary keys - links to dimension tables), non-existent values ​​may simply not be included in the fact table, that is, the presence of empty cells in the database is excluded. Indexing provides reasonable access speed to data in fact tables.

4. Data mining

The sphere of regularities differs from the two previous ones in that in it the accumulated information is automatically generalized to information that can be characterized as knowledge. This process is extremely relevant for users now, and its importance will only grow over time, since, according to the law given in, "the amount of information in the world doubles every 20 months," while "computer technologies that promised a fountain of wisdom are still that only regulate the flow of data".

Data mining is defined in most publications aphoristically - "extraction of grains of knowledge from the mountains of data", "data development - by analogy with the development of minerals" . At the same time, in English there are two terms translated as IAD - Knowledge Discovery in Databases (KDD) and Data Mining (DM). In most works, they are used as synonyms [see, for example,], although some authors [, ] consider KDD as a broader concept - a scientific direction formed "at the intersection of artificial intelligence, statistics and database theory" and providing the process of extracting information from the data and its use, and DM - as a set of inductive methods of this process, that is, what will be defined below as the stage of free search for IAD.

Let us dwell on the following definition: IAD is a decision support process based on the search for hidden patterns (information patterns) in data [ , ]. It should be noted that most of the IAD methods were originally developed within the framework of the theory of artificial intelligence (AI) in the 70-80s, but have become widespread only in recent years, when the problem of intellectualizing the processing of large and rapidly growing volumes of corporate data required their use as data storage add-ons.

4.2.2. Predictive Modeling

Here, at the second stage of the IAD, the fruits of the work of the first are used, that is, the patterns found in the database are used to predict unknown values:

  • when classifying a new object, we can with certain certainty attribute it to a certain group of results of considering the known values ​​of its attributes;
  • when predicting a dynamic process, the results of determining the trend and periodic fluctuations can be used to make assumptions about the likely development of some dynamic process in the future.

Returning to the considered examples, we will continue them at this stage. Knowing that someone Ivanov is a programmer, you can be 61% sure that his age

It should be noted that a free search reveals general patterns, i.e., it is inductive, while any forecast makes guesses about the values ​​of specific unknown quantities, therefore, it is deductive. In addition, the resulting constructions can be both transparent, i.e., allowing reasonable interpretation (as in the example with generated logical rules), and uninterpretable - "black boxes" (for example, no one knows exactly how to construct and train a neural network that's what works).

4.2.3. Exception Analysis (Forensic Analysis)

The subject of this analysis are anomalies in the disclosed patterns, that is, unexplained exceptions. To find them, you must first determine the norm (the stage of free search), followed by highlighting its violations. So, having determined that 84% of general education schools are classified as municipal ownership, one can ask the question - what is included in the 16% that are an exception to this rule? Perhaps they will find a logical explanation, which can also be framed in the form of a pattern. But it may also turn out that we are dealing with errors in the original data, in which case exception analysis can be used as a tool for cleaning up information in the data warehouse.

4.3. Classification of technological methods of IAD

All IAD methods are divided into two large groups according to the principle of working with the initial training data.

  1. In the first case, the raw data can be stored in an explicitly detailed form and directly used for predictive modeling and/or exception analysis; these are the so-called methods of reasoning based on the analysis of precedents. The main problem of this group of methods is the difficulty of their use on large amounts of data, although it is in the analysis of large data warehouses that IAD methods are most useful.
  2. In the second case, the information is first extracted from the primary data and converted into some formal constructions (their form depends on the specific method). According to the previous classification, this stage is performed at the stage of free search, which is basically absent in the methods of the first group. Thus, for predictive modeling and exception analysis, the results of this stage are used, which are much more compact than the initial data arrays themselves. In this case, the resulting constructions can be either "transparent" (interpretable) or "black boxes" (uninterpretable).

These two groups and the methods included in them are shown in Fig. four.


Rice. 4. Classification of technological methods of IAD.

4.3.1. Direct use of training data

The generalized Lazy-Learning algorithm related to the group under consideration looks like this (the description of the algorithm is taken from ). An example is given as input to the classifier, and a prediction of the class that includes it is expected as output. Each example is represented by a point in the multidimensional space of properties (attributes) belonging to some class . Each attribute accepts continuous values ​​or discrete values ​​from a fixed set. For example, its most likely class is returned.

An individual feature of the k-nearest neighbor algorithm is the method for determining in it the a posteriori probability that an example belongs to a class:

where returns 1 when the arguments are equal, or 0 otherwise, is the proximity function defined as

a is the set of k nearest neighbors in the set of known training examples, the proximity of which to the classified example is determined by the distance function . The k-nearest neighbor method calculates the distance from to each using the formula:

moreover, r=2 (Euclidean space) is most often taken, and the function, depending on the type of attribute, is determined in the following ways:

w(f) is a function of the weight of the attribute f. In pure k-nearest neighbor algorithm:

that is, this function is considered a constant.

The nearest neighbor method is a special case of the k-nearest neighbor method for k=1. More complex algorithms of the Lazy-Learning type are based on the same generalized algorithm [ , , ], but either determine the posterior probabilities that examples belong to classes, or (as, for example, Nested Generalized Exemplars Algoritm ) complicate the calculation of the function w(f).

The peculiarity of this group of methods is that the prediction of unknown values ​​is based on an explicit comparison of a new object (example) with known examples. In the case of a large number of training examples, in order not to sequentially scan the entire training set to classify each new example, sometimes the sampling technique is used to select a relatively small subset of "typical representatives" of training examples, based on comparison with which the classification is performed. However, this technique should be used with some caution, since some significant patterns may not be reflected in the selected subset.

As for the most famous representative of this group - the k-nearest neighbor method - it is more suitable for those subject areas where object attributes are predominantly numerical, since the definition of the distance between examples in this case is more natural than for discrete attributes.

4.3.2. Identification and use of formalized patterns

The methods of this group extract common dependencies from a set of data and then allow them to be applied in practice. They differ from each other:

  • by types of retrieved information (which are determined by the task being solved - see the classification of IAD tasks above);
  • according to the way of presenting the found regularities.

The formalism chosen to express the regularities allows us to distinguish three different approaches, each of which has its roots in the corresponding branches of mathematics:

  • cross-tab methods;
  • methods of logical induction;
  • methods for deriving equations.

Boolean methods are the most versatile in the sense that they can work with both numeric and other types of attributes. The construction of equations requires the reduction of all attributes to a numerical form, while cross-tabulation, on the contrary, requires the transformation of each numerical attribute into a discrete set of intervals.

Cross tab methods

Cross tabulation is a simple form of analysis widely used in online analytical processing (OLAP) reporting. A two-dimensional crosstab is a matrix of values, each cell of which lies at the intersection of attribute values. The extension of the idea of ​​a cross-tab representation to the case of a hypercubic information model is, as already mentioned, the basis of multidimensional data analysis, therefore this group of methods can be considered as a symbiosis of multidimensional operational analysis and data mining.

Cross-tab visualization is the simplest embodiment of the idea of ​​searching for information in data using the cross-tab method. Strictly speaking, this method does not quite fit the noted property of the IAD - the transition of the initiative to the system at the stage of free search. In fact, cross-tab visualization is part of the functionality of OLAP. Here, the system only provides a matrix of indicators in which the analyst can see the pattern. But the very provision of such a cross-tab is intended to search for "patterns of information" in the data for decision support, that is, satisfies the above definition of IAD. Therefore, it is no coincidence that many authors [ , , ] still classify cross-tab visualization as an IAD method.

The IAD methods of the cross-tabulation group also include the use of Bayesian networks (Bayesian Networks), which are based on the Bayes theorem of probability theory to determine the a posteriori probabilities that make up a complete group of pairwise incompatible events by their a priori probabilities:

Bayesian networks have been actively used to formalize expert knowledge in expert systems, but have recently been used in IAD to extract knowledge from data.

After pruning a tree, its various terminal nodes are at different levels, that is, the path to them includes a different number of attribute value checks; in other words, the values ​​of many attributes are not considered at all for arriving at terminal nodes lying at high levels of the tree. Therefore, when building decision trees, the order in which attributes are tested at decision nodes is critical.

The strategy used in decision tree induction algorithms is called the divide-and-conquer strategy, as opposed to the separate-and-conquer strategy on which a large number of rule induction algorithms are built. Quinlan has described the following split-and-capture algorithm.

Many attributes ;
- set of possible attribute values (thus, the areas of definition of continuous attributes for building decision trees must also be divided into a finite set of intervals).

Quinlan proposed to calculate the E-score as follows. Let for the current node:

Number of positive examples;
- number of negative examples;
- number of positive examples with value for ;
- number of negative examples with value for .

E-score is an information-theoretic measure based on entropy. It shows the measure of uncertainty in the classification that occurs when the attribute in question is used in a decision node. Therefore, it is considered that the attribute with the lowest E-score has the greatest classifying power. However, the E-score defined in this way also has disadvantages: in particular, it gives an advantage to attributes with a large number of values ​​when building a tree. Therefore, in some papers [ , ] modifications of the E-estimate are proposed that eliminate these shortcomings.

Pruning the decision tree to improve predictive accuracy when classifying new examples is usually performed on the constructed complete tree, that is, the post-simplification procedure is performed. Moving from bottom to top, the decision nodes with the corresponding subtrees are replaced by terminal nodes until the given heuristic measure is optimized.

Rule induction

The popularity of decision trees stems from the speed of their construction and ease of use in classification. Moreover, decision trees can be easily converted into sets of symbolic rules - by generating one rule from each path from the root to the terminal node. However, the rules in such a set will be non-overlapping, because in the decision tree each example can be assigned to one and only one terminal node. A more general (and more realistic) case is the existence of a theory consisting of a set of non-hierarchical overlapping symbolic rules. A significant part of the algorithms that perform the induction of such sets of rules are combined by the strategy of separating and capturing (separate-and-conquer), or covering (covering), which was initiated by the works of R. Michalski [ , ]. The term "separation and capture" was formulated by Pagallo and Haussler, characterizing this strategy of induction as follows:

  • produce a rule that covers part of the training set;
  • remove rule-covered examples from the training set (separation);
  • sequentially learn other rules covering groups of remaining examples (capture) until all examples have been explained.

Rice. 5 shows the general rule induction algorithm by the branch and capture method. Different implementations of the subprograms called in the general algorithm determine the variety of known separation and capture methods.


Rice. 5. General separation and capture algorithm for rule induction.

The SEPARATEANDCONQUER algorithm starts with an empty theory. If there are positive examples in the training set, the FINDBESTRULE subroutine is called to retrieve a rule that covers part of the positive examples. All covered examples are then separated from the training set, the generated rule is included in the theory, and the next rule is searched for on the remaining examples. Rules are retrieved until there are no more positive examples or until the RULESTOPPINGCRITERION stop criterion is met. Often, the resulting theory is subjected to POSTPROCESS post-processing.

The FINDBESTRULE procedure searches the hypothesis space for a rule that optimizes the selected quality criterion described in EVALUATERULE. The value of this heuristic function, as a rule, is the higher, the more positive and less negative examples are covered by the candidate rule. FINDBESTRULE processes Rules, an ordered list of candidate rules generated by the INITIALIZERULE procedure.

New rules are always inserted in the right places (INSERTSORT), so that Rules is always a list ordered in descending order of the heuristic evaluations of the rules. In each cycle, SELECTCANDIDATES selects a subset of candidate rules, which is then cleared in REFINERULE. Each cleanup result is evaluated and inserted into the sorted list of Rules, unless STOPPINGCRITERION prevents it from happening. If the NewRule score is better than the best of the previously found rules, the NewRule value is assigned to the BestRule variable. FILTERRULES selects a subset of the ordered list of rules to be used in further iterations. When all candidate rules have been processed, the best rule is returned.

The main problem facing rule induction algorithms is to avoid overfitting when using noisy data. The overfit avoidance in the separation and capture algorithms can handle the noise:

Comparing the capabilities of decision trees and rule induction

Rule induction and decision trees, being ways of solving one problem, differ significantly in their capabilities. Despite the widespread use of decision trees, rule induction, for a number of reasons noted in [ , , ], seems to be a more preferable approach.

On the other hand, rule induction is carried out by much more complex (and slower) algorithms than decision tree induction. Particularly great difficulties arise with the post-simplification of the constructed theory, in contrast to the simplicity of pruning decision trees, which Furnkranz noticed: pruning branches in a decision tree will never affect neighboring branches, while pruning the conditions of a rule affects all rules that overlap with it (Fig. 6).


Rice. 6. Forgiveness in learning algorithms
(a) separation and capture; and (b) separation and capture.

Rice. 6(a) illustrates the operation of postsimplification in the induction of decision trees. The right half of the overcomplicated tree is covered by sets C and D of training examples. When the simplifying algorithm decides to prune these two terminal vertices, the node that generated them becomes a terminal node, which is now covered by examples. The left branch of the decision tree is not affected by this operation.

On the other hand, cutting off the conditions from the rule means its generalization, that is, in a new form, it will cover more positive and more negative examples. Therefore, these additional positive and negative examples must be excluded from the training set so as not to affect the induction of subsequent rules. In the case in Fig. 6(b), the first of the three rules is simplified to cover not only the examples covered by the original version, but also all the examples that the third rule covers, as well as some of the examples that the second rule covers. If the third rule can then be simply removed by the post-simplification algorithm, then the situation with the remaining set of examples B2 is not so simple. The second rule naturally covers all instances of the set B2, because it was produced to cover the instances of the set B that includes it. However, it may well be that another rule is more appropriate for separating the positive examples of B2 from the remaining negative examples. Correct handling of such situations requires close integration of pre-simplification and post-simplification processes, which significantly complicates the rule induction algorithm and degrades its performance.

Therefore, based on the comparison, we can conclude that the construction of decision trees is justified in simple problems with a small amount of initial information due to the simplicity and speed of their induction. However, when analyzing large volumes of data accumulated in storages, the use of rule induction methods is preferable, despite their relative complexity.

Equation Derivation Methods

Equation inference methods attempt to express patterns hidden in the data in the form of mathematical expressions. Therefore, they are only able to work with attributes of a numeric type, while other attributes must be artificially encoded with numeric values. This leads to several problems that limit the use of these methods in practice. However, they are widely used in many applications.

Statistics

Classical methods of statistical analysis are used in IAD tools most often to solve the problem of forecasting.

  1. Identification of trends in time series. The trend of the average level can be represented as a graph or an analytical function, around the value of which the actual values ​​of the levels of the process under study vary. Often, average level trends are called the deterministic component of the process, and the corresponding time series is expressed by the equation , where is the level of the series at time t, is the deterministic component of the series, is the random component. The deterministic component is usually represented by a fairly simple analytical function - linear, parabolic, hyperbolic, exponential - whose parameters are selected according to historical data for a better approximation of historical data.
  2. Harmonic analysis. In many cases, smoothing time series using trend detection does not give satisfactory results, since autocorrelations are observed in the residuals. The reason for the autocorrelation of the residuals can be noticeable periodic fluctuations that are often found in the time series relative to the selected trend. In such cases, one should resort to harmonic analysis, that is, to isolating the periodic component from the dynamic series. Based on the results of separating the trend and the periodic component from the time series, a statistical forecast of the process can be performed according to the principle of extrapolation, on the assumption that the trend and fluctuation parameters will remain for the forecast period [, p. 304].
  3. Correlation-regression analysis. Unlike a functional (rigidly determined) connection, a statistical (stochastically determined) connection between variables occurs when, with a change in the value of one of them, the second one can, within certain limits, take any values ​​with certain probabilities, but its average value or other statistical characteristics change according to a certain law [, S. 191-192]. A special case of a statistical relationship, when different values ​​of one variable correspond to different average values ​​of another, is a correlation. In accordance with the essence of the correlation relationship, its study has two goals:
    1) measurement of the parameters of the equation expressing the relationship of the average values ​​of dependent variables with the values ​​of the independent variable (the dependence of the average values ​​of the resulting attribute on the values ​​of factor attributes);
    2) measurement of the closeness of the relationship of signs between themselves [, P. 195-196].
    The method of correlation-regression analysis is well studied [, 19, 29] and is widely used in practice. However, it has a number of limitations:
    1) to ensure sufficient accuracy and reliability, the number of observations must be tens or hundreds of times greater than the number of factors, so that the law of large numbers, acting in full force, ensures effective mutual cancellation of random deviations from the regular nature of the relationship of features;
    2) for a reliable expression of the regularity in terms of the average value, sufficient qualitative homogeneity of the population is required so that the correlation parameters are not distorted; in addition, sometimes, as a condition for correlation analysis, it is put forward the need to subordinate the distribution of the population according to the resultant and factor characteristics to the normal probability distribution law (this condition is associated with the use of the least squares method when calculating the correlation parameters - only with a normal distribution does it give an estimate of the parameters that meets the principles of maximum likelihood ), although in practice, even with the approximate fulfillment of this prerequisite, the least squares method gives good results [ , P. 14];
    3) the method of correlation-regression analysis cannot explain the role of factor traits in creating an effective trait [, P. 198];
    4) correlation indicators should be interpreted only in terms of variations in the resultant and factor characteristics; if the task is to measure the relationship between changes in the characteristics of an object over time, then the method of correlation-regression analysis requires significant changes (it is required to study the correlation of time series) [ ; , S. 307-313].
    The correlation-regression models (CRM) obtained as a result of the application of the analysis are usually quite well interpreted and can be used in predictive modeling. But, as noted in, it is impossible to apply this type of analysis without having a deep knowledge in the field of statistics. The theoretical training of the analyst plays a particularly important role here, so few existing IAD tools offer the method of correlation and regression analysis as one of the data processing tools.
  4. Correlation of series of dynamics. The problem of studying causal relationships in time is very complex, and a complete solution to all the problems of such a study has not yet been developed [, P. 307]. The main difficulty is that if there is a trend over a sufficiently long period of time, most of the sum of squared deviations is associated with the trend; at the same time, if two signs have trends with the same direction of changing levels, then this will not mean a causal relationship at all. Therefore, in order to obtain real correlation indicators, it is necessary to abstract from the distorting influence of trends - to calculate deviations from trends and measure the correlation of fluctuations (the work is devoted to a detailed consideration of this approach). However, it is not always possible to transfer the conclusions about the tightness of the relationship between fluctuations to the relationship of the dynamics series as a whole (according to the example given in [, p. fertilizers cannot be reduced only to the correlation of fluctuations).

Neural networks

artificial neural networks as a means of information processing were modeled by analogy with the known principles of the functioning of biological neural networks. Their structure is based on the following assumptions [, p. 3]:

  • information processing is carried out in a set of simple elements - neurons;
  • signals between neurons are transmitted through connections from outputs to inputs;
  • each link is characterized by a weight by which the signal transmitted over it is multiplied;
  • each neuron has an activation function (usually non-linear), whose argument is calculated as the sum of the weighted input signals, and the result is considered the output signal.

Thus, neural networks are a set of connected nodes, each of which has an input, an output, and an activation function (usually non-linear) (Fig. 7). They have the ability to learn from a known set of examples of the training set. The trained neural network is a "black box" (uninterpretable or very difficult to interpret predictive model), which can be applied in classification, clustering and forecasting problems.


Rice. 7. Neuron with activation function F; .

Training a neural network consists in adjusting the weight coefficients that connect the outputs of some neurons with the inputs of others. Network training can be performed according to one of two basic scenarios:

Most often, IAD tools use a special type of neural networks trained "with a teacher" - multilayer perceptrons [ , pp. 54-55]. On fig. 8 shows such a neural network with two layers of neurons, which has three input and three output variables (in the general case, the number of inputs, the number of outputs, the number of layers and the number of neurons in each inner layer can be anything). The output of each neuron of the previous layer is connected to the input of each neuron of the next layer.


Rice. 8. Multilayer perceptron trained by the error backpropagation procedure.

The adjustment of the weights of the multilayer perceptron is carried out by the error backpropagation algorithm [ , P. 56-69]. When learning, it is assumed that for each input vector (set of inputs) there is a paired target vector (set of outputs), and together they form a training pair (example). Before starting training, all weights should be assigned small initial values, chosen randomly, to prevent pathological cases of learning failure. The entire set of training pairs constitutes the training set. Network training requires the following operations:

  1. choose a training pair from the training set;
  2. apply the input vector of the training pair to the input of the network;
  3. calculate the output of the network;
  4. calculate the difference between the network output and the target vector of the training pair;
  5. adjust the network weights to minimize the error;
  6. repeat steps 1-5 for each pair of training sets until the error on the entire set reaches an acceptable level.

Backpropagation training is carried out layer by layer, starting from the output layer, in steps 4 and 5.

Being "universal approximators", perceptrons can learn quite complex patterns, in contrast to regression models, in which the type of approximating function is selected from a limited possible set. But this flexibility also has a downside - the number of degrees of freedom of the predictive model created often exceeds the number of examples used for training. This means that the neural network can "learn" even from an array of generated random numbers. And indeed, as shown by the use of a neural network to solve the test problem of stock market analysis given in , it perfectly explains all market fluctuations in the past, but does not give a reasonable forecast for the future. Improving the predictive accuracy of the trained network can be achieved by using only some part of the training set for training the neural network, while the rest of the examples are used to test the adequacy of the created model on unknown data; at the same time, one should try to train the network with the least complex configuration possible to reduce the number of degrees of freedom.

There are also a number of other disadvantages that limit the use of neural networks as an IAD tool.

The main problem of training neural networks is the synthesis of a network structure capable of learning on a given training set. There is no guarantee that the process of learning a network of a certain structure will not stop before reaching an acceptable error threshold, or will not fall into a local minimum. Although multilayer networks are widely used for function classification and approximation, their structural parameters still have to be determined by trial and error. According to the conclusions, the existing theoretical results provide only weak guidelines for choosing these parameters in practical applications.

Thus, neural networks, a fairly powerful and flexible IAD tool, should be used with some caution and are not suitable for all problems requiring corporate data mining.

4.3.3. conclusions

As can be seen from the review, none of the considered methods is able to cover all the tasks that provide support for managerial decision-making based on intellectual analysis of the contents of data warehouses. But most mining systems on the market implement one or three methods (for example, Pilot Discovery Server from Pilot Software Inc. and Information Harvester from Information Harvester Corp. - only decision trees, Idis from Information Discovery Inc. - decision trees and rule induction, Darwin by Thinking Machines - neural networks, decision trees and data visualization, MineSet by Silicon Graphics - decision trees, association rule induction and data visualization), so in real applications, in order not to lose a large number of significant patterns, you usually have to use several different tools. In addition, many tools do not allow direct work with data warehouses, requiring preliminary preparation of initial data for analysis in the form of flat files of a fixed structure, which also complicates their practical use.

5. Complementarity of OLAP and IAD

Online analytical processing and data mining are two components of the decision support process. But today, most OLAP systems only focus on providing access to multidimensional data, and most IA tools that work in the realm of patterns deal with one-dimensional data perspectives. These two types of analysis should be tightly coupled, that is, OLAP systems should focus not only on access, but also on finding patterns.


Rice. 9. Architecture of the multidimensional data mining system.

The ideal goal of building a corporate information and analytical system is to create a closed-loop DSS. As N. Raden noted, "many companies have created ... excellent data warehouses, ideally sorting through the mountains of unused information, which in itself does not provide either a quick or sufficiently competent response to market events" [, p. 39]. In particularly dynamic areas (such as retail), where the situation changes daily, timely and competent decision making is not ensured even with the use of conventional OLAP and IA tools. They must be combined with each other and have feedback to the original data processing systems, so that the results of the DSS work are immediately transmitted in the form of control actions to operational systems. Thus, Wal-Mart, the largest US retail company, is developing a closed-loop DSS;

4) headings in the "predicate" are simple in cases where its columns do not have a common content, or complex - when the content common to several columns is detailed in each of them. Then the heading is indicated in the form of several tiers, for example:

To enhance the visibility of the materials of the analysis, it is often used graphic methods. For example, tables that record the dynamics of indicators are accompanied by figures in which this dynamics is presented in the form of curves or bars. The structure of generalizing indicators in the columns of analytical tables is illustrated in the form of pie charts. Other forms of diagrams are also used.

WiFi