Our organization members are all small size business owners to corporate enterprises, with a major corporation mentality.

Data Analysis

Data analysis, the process of systematically collecting, cleaning, transforming, describing, modeling, and interpreting data, generally employing statistical techniques. Data analysis is an important part of both scientific research and business, where demand has grown in recent years for data-driven decision making. Data analysis techniques are used to gain useful insights from datasets, which can then be used to make operational decisions or guide future research. With the rise of “big data,” the storage of vast quantities of data in large databases and data warehouses, there is increasing need to apply data analysis techniques to generate insights about volumes of data too large to be manipulated by instruments of low information-processing capacity.

Data collection

Datasets are collections of information. Generally, data and datasets are themselves collected to help answer questions, make decisions, or otherwise inform reasoning. The rise of information technology has led to the generation of vast amounts of data of many kinds, such as text, pictures, videos, personal information, account data, and metadata, the last of which provide information about other data. It is common for apps and websites to collect data about how their products are used or about the people using their platforms. Consequently, there is vastly more data being collected today than at any other time in human history. A single business may track billions of interactions with millions of consumers at hundreds of locations with thousands of employees and any number of products. Analyzing that volume of data is generally only possible using specialized computational and statistical techniques.

Process

For data to be analyzed, it must first be collected and stored. Raw data must be processed into a format that can be used for analysis and be cleaned so that errors and inconsistencies are minimized. Data can be stored in many ways, but one of the most useful is in a database. A database is a collection of interrelated data organized so that certain records (collections of data related to a single entity) can be retrieved on the basis of various criteria. The most familiar kind of database is the relational database, which stores data in tables with rows that represent records (tuples) and columns that represent fields (attributes). A query is a command that retrieves a subset of the information in the database according to certain criteria. A query may retrieve only records that meet certain criteria, or it may join fields from records across multiple tables by use of a common field.

Frequently, data from many sources is collected into large archives of data called data warehouses. The process of moving data from its original sources (such as databases) to a centralized location (generally a data warehouse) is called ETL (which stands for extract, transform, and load).

The extraction step occurs when you identify and copy or export the desired data from its source, such as by running a database query to retrieve the desired records.

The transformation step is the process of cleaning the data so that they fit the analytical need for the data and the schema of the data warehouse. This may involve changing formats for certain fields, removing duplicate records, or renaming fields, among other processes.

Finally, the clean data are loaded into the data warehouse, where they may join vast amounts of historical data and data from other sources.

After data are effectively collected and cleaned, they can be analyzed with a variety of techniques. Analysis often begins with descriptive and exploratory data analysis. Descriptive data analysis uses statistics to organize and summarize data, making it easier to understand the broad qualities of the dataset. Exploratory data analysis looks for insights into the data that may arise from descriptions of distribution, central tendency, or variability for a single data field. Further relationships between data may become apparent by examining two fields together. Visualizations may be employed during analysis, such as histograms (graphs in which the length of a bar indicates a quantity) or stem-and-leaf plots (which divide data into buckets, or “stems,” with individual data points serving as “leaves” on the stem).

big data, in technology, a term for large datasets. The term originated in the mid-1990s and was likely coined by Doug Mashey, who was chief scientist at the American workstation manufacturer SGI (Silicon Graphics, Inc.). Big data is traditionally characterized by the “three V’s”: volume, velocity, and variety. Volume naturally refers to the large size of such datasets; velocity refers to the speed with which such data are produced and analyzed; and variety refers to the many different types of data, which can be in text, audio, video, or other forms. (Two further V’s are sometimes added: value, referring to the usefulness of the data; and veracity, referring to the data’s truthfulness.) Since the term big data was coined, the amount of data has grown exponentially. In 1999 an estimated 1.5 exabytes (1 exabyte = 1 billion gigabytes) of data were produced worldwide; in 2020 that number grew to an estimated 64 zettabytes (1 zettabyte = 1,000 exabytes). About the turn of the 21st century, big data referred to datasets of a few hundred gigabytes each; in 2021 EsNet, the U.S. Department of Energy’s data-sharing network, carried more than 1 exabyte of data.

In the 2020s nearly every industry uses big data. Entertainment companies, particularly streaming companies, use the data generated by consumers to determine which song or video a given consumer may want to see next or even to determine what kind of movie or television series the companies should produce. Banks rely on big data to find patterns that could indicate fraud or persons who may be a credit risk. Manufacturers use big data to detect faults in the production process and to avoid costly shutdowns by finding the best time for equipment maintenance.