Serious Data Compression for Network Transport in MultiValue Systems

Each character of each record and data element in MultiValue systems is made up of eight bit ASCII characters, ranging from 0-255. Most common data sets use only portions of the ASCII table, in relatively consistent patterns. Much data is purely alphabetic, some is alphanumeric, some is numeric, and some is Boolean. Often, complete records fall completely within these subsets of the ASCII table. Most of these subsets can be represented by less than eight bits. This presents an opportunity for serious data compression, especially for network transport.

Cascading Evaluation of Compression Type

The first step in my compression algorithm is the cascading evaluation of the compression type. I resolve the data into the following types:

Boolean data
Numeric data
Alphabetic data

All one case
Mixed case

Alphanumeric data

All one case
Mixed case

I use the CONVERT function repeatedly to remove data from the record to be compressed. To determine if data is Boolean, for instance, I remove all Boolean and all system-delimiters from the data. If the remaining data is null, the original data is Boolean and can be compressed according to the Boolean algorithm (which is the best compression, down to three bits). If the remaining data is not null, I continue testing the data until the best compression algorithm for that data is found. There is overhead associated with determining the compression type.

*
*/ test to see if it is 3 bit data (booleans)
*
TESTDTA=INDTA
CONVERT '01YN.' TO '' IN TESTDTA
CONVERT SEPS TO '' IN TESTDTA
IF TESTDTA='' THEN
SEMA='B'
CALL COMPRESS.3(INDTA,OUTDTA)
INS SEMA BEFORE OUTDTA<1>
RETURN
END

There is also performance overhead in compressing and expanding the data. In MultiValue Basic, since each byte must be expanded into a set of bytes representing the bits in the byte, rather than being able to work at the real bit level, there is some overhead involved in this expansion of the data. Some versions of MultiValue have a system-level function, usually called 'MB' and used with ICONV and OCONV, which can perform this. I provide Basic subroutines which performs these same functions (called ICONV.MB/OCONV.MB).

SUBROUTINE ICONV.MB(BINARY,SQ)
*** converts binary string to ascii
*** sequence
TEST=BINARY
CONVERT '1' TO '0' IN TEST
IF TEST='00000000' ELSE SQ=0; RETURN
IF BINARY[1,1]=1 THEN 
SSMM=128
END ELSE SSMM=0
IF BINARY[2,1]=1 THEN
SSMM+=64
END
IF BINARY[3,1]=1 THEN
SSMM+=32
END
IF BINARY[4,1]=1 THEN
SSMM+=16
END
IF BINARY[5,1]=1 THEN
SSMM+=8
END
IF BINARY[6,1]=1 THEN
SSMM+=4
END
IF BINARY[7,1]=1 THEN
SSMM+=2
END
IF BINARY[8,1]=1 THEN
SSMM+=1
END
SQ=SSMM
RETURN

There are several considerations you might have regarding using my Basic subroutine versus the system-level one. I have little doubt that the system-level function will work with less overhead than my Basic one, though mine is written tightly, if only because the system-level one is written in C or assembler. However, other issues may be more important to you. If your software has to run on multiple database platforms, issues such as portability, maintenance, coding simplicity, and coding standards could affect your decision. Using my Basic subroutine for this function will allow you to have one version of your software which runs on all platforms. Alternately, you could build in a CASE statement or use compiler INCLUDEs to implement platform-specific usage of these functions. You could even take it one step further and ask your database vendor(s) to implement the MB conversion in ICONV/OCONV in their next releases. If you want to take advantage of performance advantages wherever you can, you could use this second CASE/INCLUDE approach. If you are satisfied with the performance of my subroutine, however, I suggest you use it all of the time and simplify the coding and maintenance of your source code.

Deciding Whether to Use Data Compression

Another consideration is whether to use compression at all. Several factors should be assessed when deciding this:

Consideration of length of transport

LAN
WAN
Internet
Middleware
Message Broker/ Message Bus architecture

Consideration of volume of data
Consideration of usual type of data
Note: two bytes of overhead for each record compressed. This could be cut to one if MultiValue vendors supported A=B[2,0] to yield the string from the second byte to the end of the string, no matter how long

The further the data has to be transported, the more processor time (both Central/CPU and other) will be used throughout its journey. It follows then that using data compression becomes more important when working over the Internet and over Wide Area Networks, and also when other intermediate software such as middleware or message brokers are involved. This is not to say that data compression cannot be advantageous over Local Area Networks; it might.

The issue of the volume of data can be two-fold. Data compression can be very advantageous when the volume of data is large. However, if the data is very heterogeneous and comprises bytes throughout the ASCII set, the likelihood of high rates of compression falls with my algorithm. If, however, most of the data will compress to seven or eight bits or less most of the time, regardless of volume, my compression algorithm could yield impressive results in most situations. Again, other factors such as the length of transport can magnify these effects.

As implied before, the type of the data is very important when assessing the benefits of using my data compression. Because the most advantageous, highest compression types are tested for first, this means that the better the type of data (Boolean, then numeric, then alphabetic, then alphanumeric), the better the compression and also the faster the determination of which algorithm to use is made. The compression of Boolean data will begin earlier than the compression of other types of data and will result in the highest rate of compression, for instance (fig. 1).

SUBROUTINE COMPRESS.3(INDTA,OUTDTA)
***
compresses data from 8 bits down to 3, when appropriate
* booleans, separators, period
 OUTDTA=''; BUFFER=''
  LGT=LEN(INDTA)
  FOR AAA=1 TO LGT
    CHR=INDTA[AAA,1]
    SQ=SEQ(CHR)
    BEGIN CASE
    CASE SQ=48 OR SQ=49; SSQQ=SQ-48
    CASE SQ=89; SSQQ=2
    CASE SQ=78; SSQQ=3
    CASE SQ=46; SSQQ=4
    CASE SQ>=252 AND SQ<=254; SSQQ=SQ-247
    CASE 1; SSQQ=''
    END CASE
    IF SSQQ#'' THEN
      CALL OCONV.MB(SSQQ,BINARY)
      BUFFER:=BINARY[6,3]
    END
    IF LEN(BUFFER)>3 THEN
      N.BINARY=BUFFER[1,8]
      CALL ICONV.MB(N.BINARY,N.SSQQ)
      OUTDTA:=CHAR(N.SSQQ)
      BUFFER=BUFFER[9,999999]
    END
  NEXT AAA
  IF BUFFER#'' THEN
    N.BINARY=BUFFER'L%8'
    CALL ICONV.MB(N.BINARY,N.SSQQ)
    OUTDTA:=CHAR(N.SSQQ)
  END
  RETURN

Fig. 1

Very small strings of data may be unwise to compress, however, because it takes two bytes of data to describe the type of data compression and this would add to the total amount of data being transported. Personally, I would like to see database vendors support string functions of the type A=B[2,0], which would yield from the second character in B to the end of B, regardless of the length of B (a function necessary to lower this two-byte overhead to one). This really is not possible to do in a reliable way in any other function (including FIELD(B,X,1,999999)) for very long strings. This would allow me to cut this overhead to one byte. If it were only for my compression routine, I probably not mention this, but I have felt the need for this for some time in many programming tasks. A=B[2,LEN(B)] is actually two function calls and with long strings, significantly adds overhead to the algorithm.

Issues of Data Storage and Network Transport

Segment Marks (ASCII Character 255) are produced during compression and are usually a part of the derived compressed data. This is an unavoidable result of the algorithm but is acceptable to me because the primary purpose of this algorithm is network transport, not data storage. That said, however, some systems may allow data to be stored when it contains Segment Marks. In my experience, however, few, if any, MultiValue hashed files will handle Segment Marks as part of the data and will either produce error messages, abort, or produce group format errors when such data is saved or retrieved. It may be possible to save this data to operating system level files at the NT, Unix, Linux, etc., level, but that you would attempt at your own risk; I have not tested it. I have posed the question through the years to some database vendors whether delimiting records (using CHAR(255)) in hashed file groups is really necessary and whether it could be accomplished by just working with the length of the data records as shown in the record key header, but I don't think this has ever changed.

Segment Marks are not usually an issue in network transport. I provide here both MultiValue-side Basic subroutines and Visual Basic-side subroutines. Nathan Rector was kind enough and skilled enough to translate my algorithm into Visual Basic to supply those subroutines. Thank you again, Nathan.

Licensing and Liability of These Subroutines

These subroutines are a gift to the MultiValue community. They are absolutely free and not subject to any restriction, GPL, or other license. You are free to copy these and even to rebrand them and re-market them, if you see fit, understanding, of course, that I retain the right to distribute my versions free of charge to anyone who wants them and that they will be downloadable from my http://www.racent.net home page at any time. While these are being given to promote MultiValue business, of course, I hope to bring people to my web site and promote my own business as well.

These subroutines have been tested, but not thoroughly. The algorithm involved, however, is not as difficult as it might at first appear and we expect to be able to resolve any problems encountered. Any feedback regarding bugs found (and help resolving them) will be appreciated and fixes will be implemented in the code offered in the download on my website, however, I expressly absolve both Nathan Rector and myself from any and all liability regarding the use of these routines.

To the best of my knowledge and Nathan's, and after seeking such compression routines in a fairly thorough manner before I decided to develop my own, these are the only compression subroutines of this type in the MultiValue marketplace. The others I found all seemed to be some variant of RLE or were RLE-based. These are not. For data which is found in database files/tables, we believe this type of compression to be superior to RLE-based algorithms, which may be better suited to image files.

John Racine

View more articles

Featured:

May/Jun 2011