| |
Data Warehousing

CoSORT Performs, Consolidates, and Speeds Data Integration and Staging Tasks
CoSORT's parallel sort engine and sort control language (sortcl) program do the heavy lifting in the world's largest data warehouses and e-commerce data webhouses. Designed for flat file extract - transform - load (ETL) work, they combine to rapidly perform several critical functions -- all in a single pass -- including:
sortcl's familiar data definition language (DDL) uses symbolic field-names and centralized data dictionaries to reduce script sizes and creation time, and to facilitate SQL table creation and loader and file compare metadata. Job scripts are data manipulation language (DML) commands that reference those data layouts. sortcl jobs get run from the command line, in batch scripts, from programs, and via a unique cross-platform Java GUI. sortcl and complimenting facilities in CoSORT can convert, and drop-in to replace, many third party data layouts and mainframe-based sorts. sortcl experts worldwide provide conversion and implementation assistance.
Extracting (Selection)
 |
|
Most operational data in commercial and government enterprises resides internally in sequential flat files, mainframe (relational) database tables, or is imported from data tapes and transmissions generated externally. These historical customer, inventory, and order processing databases are optimized for ad hoc queries and transactions, not extraction. A large-scale table or flat file data "dump" from these bases is the best way to begin. sortcl accepts multiple input files, or records streaming through standard in (pipes) in order to perform inner and outer joins, plus vertical (conditional) selection, on records for these processes:
|
Cleansing (Filtering)
 |
|
Beyond conditional include or omit criteria, additional record filtering functions can be used to "horizontally" select virtual records for sorting, reformatting, translation, aggregation and output reporting. sortcl's data cleaning -- though not as sophisticated as fuzzy logic tools from Vality or idCentric, for example -- includes conditional or unconditional elimination, reduction, or writing to an error file of duplicate records, headers, fields, and bytes. "Scrubbing" data increases the efficiency of all downstream warehousing processes, including those left in sortcl:
|
Sorting (Reordering)
 |
|
CoSORT's parallel coroutine sort engine distributes the data across multiple CPUs to provide the fastest possible reordering of the data based on (any number of) specified (fixed and/or floating) key fields and collating sequences. The results are merged and prepared for cross-table joins and mapping:
|
Matching (Joins)
 |
|
CoSORT is the only sort/ETL product on the market with a single-pass join. sortcl can match two sorted files at I/O speed, using SQL join syntax, to produce a joined output based on user-specified conditions. Performing a join within a database can take up to 20 times longer than the same join done in sortcl.
|
Remapping (Formatting)
 |
|
While fixed or variable position fields are mapped from input to output, their data can be relocated (repositioned), resized, mapped, and type-converted (see below). Header records can be parsed, stripped or rewritten on output. Special formatting characters and environment variables, -- including markup language commands for web reporting -- can be inserted. In addition, mathematical expressions (cross-calculation) between field data can be performed to derive new values on output.
|
Converting (Translation)
 |
|
While fixed or variable position fields are mapped from input to output, their data can be relocated, resized, and converted by type; e.g. from EBCDIC to ASCII, or mixed packed decimal to signed and zoned decimal, for example. Many mainframe binary forms are undesirable for data propagation, mining and access tools on open systems (which is why sortcl processes CSV).
|
Grouping (Aggregation)
 |
|
CoSORT's sortcl can count, summarize (total), average, and find the maximum and minimum values based on inter- and intra-record break conditions to produce sophisticated EIS summary, or "drill-down" analysis on multiple classes of field data. Also uniquely featured are running (accumulating) aggregates, and aggregates on cross-calculated values. These grouping features are also used for ad hoc MIS reports and complex trend analysis.
|
Loading (Sequencing)
 |
|
Selecting, sorting, joining, reformatting, and aggregating data not only prepare it for database repopulation qualitatively, but quantitatively as well -- the amount of data back in drops. 100 million rows can be reduced to 10 million, which, along with sorting, vastly improves the speed and efficiency of Oracle, load tools plus Sybase bcp and Red Brick's PTMU. Sorted output can be piped in index order (with sequence numbers affixed) directly into the loader. For IBM UDB users, the CoSORT Load Accelerator (CLA) for DB2 directly replaces the sort within DB2's loader on Unix.
|
|