DESIGN AND IMPLEMENTATION OF COMPUTERIZED DATA MINING SYSTEM
Demography is an area primarily developed to study human population. It includes many methods. These methods allow us to measure the dimensions and dynamics of populations. Though the methods have primarily been developed to study human populations, they are extended to a variety of areas where researchers want to know how populations of social factors can change across time through processes of birth, death, and migration. In the context of human biological populations, demographic analysis uses administrative records to develop an independent estimate of the population. Presently, National Population Commission (NPC) uses manual operation in their statistical study of human population. These manual methods are prone to numerous errors. Therefore, there is a need for an automated data mining system which reduces these errors to the barest minimum. Activity procedure in systematic was analyzed using Gane and sarsen’s approach.
1.0 Background of the study
Progress in digital data acquisition and storage technology has resulted in the growth of huge databases. This has occurred in all areas of human endeavor, from the mundane (such as supermarket transaction data, credit card usage records, telephone call details, and government statistics) to the more exotic (such as images of astronomical bodies, molecular databases, and medical records). Little wonder, then, that interest has grown in the possibility of tapping these data, of extracting from them information that might be of value to the owner of the database. The discipline concerned with this task has become known as data mining.
Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner (Efraim at al, 2010). The relationships and summaries derived through a data mining exercise are often referred to as models or patterns. Examples include linear equations, rules, clusters, graphs, tree structures, and recurrent patterns in time series.
The definition above refers to "observational data," as opposed to "experimental data." Data mining typically deals with data that have already been collected for some purpose other than the data mining analysis (for example, they may have been collected in order to maintain an up-to-date record of all the transactions in a bank). This means that the objectives of the data mining exercise play no role in the data collection strategy. This is one way in which data mining differs from much of statistics, in which data are often collected by using efficient strategies to answer specific questions. For this reason, data mining is often referred to as "secondary" data analysis. The definition also mentions that the data sets examined in data mining are often large. If only small data sets were involved, it would merely be handling classical exploratory data analysis as practiced by statisticians. When faced with large bodies of data, new problems arise.
Some of these relate to housekeeping issues of how to store or access the data, but others relate to more fundamental issues, such as how to determine the representativeness of the data, how to analyze the data in a reasonable period of time, and how to decide whether an apparent relationship is merely a chance occurrence not reflecting any underlying reality. Often the available data comprise only a sample from the complete population (or, perhaps, from a hypothetical super population); the aim may be to generalize from the sample to the population. For example, we might wish to predict how future customers are likely to behave or to determine the properties of protein structures that we have not yet seen. Such generalizations may not be achievable through standard statistical approaches because often the data are not (classical statistical) "random samples," but rather "convenience" or "opportunity" samples. Sometimes we may want to summarize or compress a very large data set in such a way that the result is more comprehensible, without any notion of generalization. This issue would arise, for example, if we had complete census data for a particular country or a database recording millions of individual retail transactions.
Data mining is often set in the broader context of knowledge discovery in databases, or KDD. This term originated in the artificial intelligence (AI) research field. The KDD process involves several stages: selecting the target data, preprocessing the data, transforming them if necessary, performing data mining to extract patterns and relationships, and then interpreting and assessing the discovered structures. Once again the precise boundaries of the data mining part of the process are not easy to state; for example, to many people data transformation is an intrinsic part of data mining. In this text we will focus primarily on data mining algorithms rather than the overall process. For example, we will not spend much time discussing data preprocessing issues such as data cleaning, data verification, and defining variables. Instead we focus on the basic principles for modeling data and for constructing algorithmic processes to fit these models to data.
1.1 Statement of Problem
The researchers discovered challenges facing manual method of both string and retrieving of data. The system proved defective. Among the problems noticed are the following:
• Difficulties encountered in keeping demographic data/information.
• Miscalculation of demographic data/information,
• Difficulties in accessing demographic data/information
• Time wasted in searching for a given demographic data/information on packed files.
• Time wasted in processing demographic data/information
The need arise for the development of computerized data mining system for National Population Commission Enugu.
1.3 Objectives of the Study
Data mining is a development of the increased use of databases to store data and provide answers to business analysis. Traditional query and report tools have been used to describe and extract what is in a database but were found to be deficient in areas studied in statement of the problem. The objective of the study will handle the following aspects:-
• Provide essential information for government decision making
• To ensure easy retrieving and updating of data /information
• Massive data collection
• To eliminate guess work in population census.
• To demonstrate increased motivation to the census workers.
• To ease the work associated with manual method in analyzing data/information.
• To eliminate the error involved with the manual method analyzing data/information.
• To save time when analyzing data/information.
• To make population council office neat and tidy as a lot of information will no longer be documented on paper but in computer.
1.4 Significance of the Study
With the growth in information technology, the study offers numerous values to the following:
• Institutions of higher learning
• JAMB and WAEC offices
• Government sectors
• Banks, etc
1.5 Scope of the Study
This project work is narrowed to data mining system For National Population Commission Enugu. It also deals with the development of database program to help in the storage of demographic data/information in the National Population Commission Enugu. It can also provide assistance to every sector for processing of data, storage of information and management.
More so, it will handle search engines that can be used to search for data.
1.6 Limitation of the Study
Owing to the scope of this project work as stated above, this project work is limited to computerized data mining system for National Population Commission Enugu. It is important to mention here that, financial attitude of the staff were major constraints in the course of fact finding.
1.7 Definition of Terms
Database: is a collection of information that is related to a particular subject or purpose.
Information: This is data that have been processed, interpreted and understood by the recipient of the message or report
Internet: is a collection of computer networks that operate to common standards and enable the computes and the program they run to communicate directly.