22 Data Storage & Compaction in Geographic Information System

Dr Seema Mehra Parihar

Learning Objectives:

In this chapter you will learn about the following:

Understand the relevance of data compaction techniques.
Learn about four data compaction techniques in raster GIS.
Recognize the spaghetti and topological data models in Vector GIS
Appreciate the need for data compaction techniques

Introduction

In your previous chapter on data models you have understood the relevance and importance of raster and vector data models. You also have understood that “The object based spatial database (those obtained by field surveying, remote sensing image analysis, photo interpretation, and digitization etc.) are generally represented in the form of coordinate lines and termed as Vector data models” (Parihar,2017). On the other hand, “when the spatial database is structured on the field-based model the basic spatial units are different forms of tessellation (regular as DEM or irregular as TIN) are termed as Raster data model” (Parihar,2017). The storage and manipulation of various location based data sets and related attribute information in both the models is very relevant and consequently have received lot of attention too and that is the premise of this chapter on data storage and compaction techniques in raster and geographic Information system (GIS). We realize by now that every data to be considered in GIS environment is to be treated as unique and has its own identity with a requirement of a storage and manipulation space. The major concern that arise is when complex entities such as more than two polygons are stored and the adjoining boundaries are entered twice, causing duplication of adjacent line and also sometimes generating in matching problems. In addition, it occupies more space in the computer. To understand this complexity and related concerns this chapter is divided into two sections, one related to the Data Compaction Techniques in Raster GIS and Second related to data storage Topology driven models in Vector GIS .

1. Data Compaction Techniques in Raster GIS

Data compaction or compression is common in GIS and is based on different algorithms that reduce the size of a computer file, but maintains all the information intact. Compression algorithms may be “lossless” (where no information is lost) or “lossy” (where some information is lost).Lossy algorithm is generally not applied to thematic data , but largely applied when image data is considered. Mostly they are applied to discrete raster data algorithms. Consequently, Data compression/compaction in a grid based raster GIS environment has been studied and researched upon by many. Many compaction techniques in raster GIS have been evolved in different years and for our purpose the techniques considered here are the following :

Chain coding,

Run length coding, Block coding and Quad trees.

1.1 .Chain Coding

Chain coding is largely undertaken as a clockwise coding method and is generally referred as Freeman Chain Coding. To begin with, from the following figure1 locate the 28 cells required for storing the given entity.

Figure 1: Locating Grid Cells for Chain Coding in Raster GIS

The above technique is simple and straightforward. “In a chain-coded representation of a map, using any starting point on the border of an object, the sequence of cardinal directions of the cells that make up the boundary of the object are recorded systematically in a clockwise direction. The polygon is defined in terms of unit cells measured in cardinal directions” Freeman. The steps are as follows :

i. Identification of points by a number ranging from 0 to 7.

For example, East may be identified as 0, North as 1, West as 2 and South as 3. You may select your own numbering system.

ii. Once a move in the direction of the line is made and recorded, the locational grid is re-centered over a new location and the next move defined in the same way.

The advantage of chain coding is that it enables storing raster data and above all it is very useful for detection of sharp turns and area estimation. The short fall is that there is a repetition of data because of repetition of adjoining boundaries.

1.2 Run-length codes

Run Length Code is an improvement over the conventional chain code technique and suitable to the personal computers with limited storage capacity. It stores a single value for a group of cells instead of storing the value for each individual cell. “This method exploits the fact that many datasets have large homogeneous regions. In this procedure, adjacent cells along a row that have the same value are treated as a group and termed a run. Each row in the grid (one pixel width) is examined in turn, and pixels having the same value, that is, homogeneous pixels are grouped together. It uses a 1D method of grouping pixels with similar or identical values” Heywood,2002. Let us try to understand this from the figure 2 where following steps are undertaken:

i. Locate the pixels with similar values;

ii. Position reference be noted;,

iii. Sequences of pixels with similar values are replaced in the memory by

a. the ‘positional reference’ to the first pixel in the grouping and

b. by the ‘number representing’ the number of pixels in the grouping.

‘

Figure 2: Run Length Coding

The run length coding is common and rather simple method for compression raster data. The left number in row two in Figure 2 depicts the number of cells in run (8 in this case) and the right size is cell value (6 in this case) and the value together noted as 8:6. Though a simple method , it’s major shortcoming in the words of Hey Wood is that “the pixel groups are identified only in one coming is that direction (parallel to the x-axis) as a result the nearby rows, that is, those above and below the current row which may have the same values of the group are represented separately”.

1.3 Block codes

Block codes “are a two-dimensional extension to run length codes. It assumes that the cells are recorded in the form of blocks of squares. The method uses square blocks to tile the area to be mapped” (Burroughs 1986). This method has largely been applied in performing union and intersection of regions and for detecting properties such as elongation. The steps undertaken in giving Block codes are:

i. Locating as many large square blocks as possible in the given raster image.

ii. The codes be noted through the Medial axis Transformation (MAT) from the square blocks.

iii. The blocks are arranged in the hierarchical form.

Figure 3: Block Code , Based on Heywood,2002

According to Heywood, 2002, “A unit square represents one cell, whereas a 4 square block represents 2x2cells, a 9 square block represents 3×3 cells and so on. Each block is coded only with the location of a cell (the lower left of the block) and the side length of the block. The larger the square that may be fitted into a region, the more efficient block coding becomes”.

1.4. Quadtree

The fourth raster compression method is Quad tree method. It is a variable spatial resolution model sometimes equated with hierarchical tessellation model. “A quadtree is a term used to describe a hierarchical or tree-based data structure whose common property is that the structure is based on the principle of the recursive decomposition of space to represent both vector and raster data” Burrough,1987 . It is similar to run length coding and are largely used while compressing area features. They generally represent raster data structures with variable spatial resolution. This can be best understood from figure 4:

Variable spatial resolutions to be noted;
Each specific area feature to be delineated;
With each area feature the raster cell sizes to be combined and adjusted;
Codes are assigned first to the larger raster cells that fit into one uniform area;
Successively smaller cells at each iteration are halved (cell dimension), until the smallest cell size is reached.

Figure 4: data Compression in Quad tree Method

One of the “advantages of the raster data model is that each cell can be subdivided into smaller cells of the same shape and orientation and the Quadtree model addresses both the resolution and the redundancy issues directly” (Pequot, 1990). Through this method ,operations such as point-in-polygon searches can be performed. However, Burroughs (1984) is of the view that “the largest problems associated with quadtrees is that tree translation is not translation-invariant — two regions of the same shape and size may have quite different quadtree, so consequently shape analysis and pattern recognition are not straight forward”. Another disadvantage is that it is time consuming to create a quadtree data structure. This implies that a re-building of the entire quadtree is required , just by a mere change in the original data set.

2. Data Storage in Vector Data

We know by now that the vector data model represents anything from a simple number to a complex entity. It is an object based approach to the representation of the real world features and is best used to represent discrete objects. It is certainly straight forward when only simple polygon is represented, but with the introduction of complex entities such as more than two polygons, the adjoining boundaries will cause duplication of adjacent line and so need more space in the computer. Moreover, there are also many matching problems that occur while overlapping two sets of same adjoining boundary . To overcome this and the related problems, two models are considered here:

2.1 Spagehetti Model

2.2 Topology Model

2.1 Spaghetti Model

Vector data that have been collected but not structured are said to be spaghetti data model like a plate of cooked spaghetti with no ends connected and no intersections affecting the plate. It is also called as non-topological or geometric or path topological model. It was originally developed to organize, manipulate and store line data. The spagehetti model enters each line separately through:

Storing the starting node
Storing the end node
Storing the vertices to note the change in direction or path
Not storing /recognizing when the two lines meet or cross each other.

Figure 5: Spaghetti Model with six lines

As figure 5 depicts that the spaghetti model records each of the six line depicted in the figure separately. What is to be noted is that connections are note recognized and nor is the crossing over or meeting points or vertices. It looks like a map but without underlying structure. Graphical elements are stored and not the Graphical entities in the spaghetti model. “The spatial relationships between the features are not retained. In other words it is a collection of points and line segments with no real connection. All the information necessary to draw a map is in place, but is randomly organized (“unlinked”). This makes organization of the data easy, but makes it very difficult to use for analysis”.

The spagehetti model does not retain any specific relation.

The positive aspect of the spaghetti model is that it is relatively efficient as a method of cartographic display and used in Computer-Assisted Cartography (CAD) where analysis is not the primary purpose.

Point Dictionary Model

A marginal improvement over the path topological models (spaghetti model) is the point dictionary model. In this model, both the transformation between numerical coordinates and points is incorporated.

i. All coordinate pairs are numbered sequentially ;

ii. All coordinate pairs are stored in random access allocation.

iii. List of Point ID prepared as the address for accessing the appropriate coordinate.

iv. Each polygon is stored as a circular list of the point ID

Figure 6: Point Dictionary Model

Therefore, as depicted in figure 6 , the storage requirements are slightly reduced as x, y pairs are stored only once. However each point ID is still stored twice for common boundaries. “Though, this model saved space in the system but the adjacency problem remained. Thus the point ID values are first retrieved from the polygon list, which in turn are used to retrieve the respective coordinates”.

Chain Dictionary Model

Chain Dictionary model is used to reduce and even overcome most of the time ,the adjancy problem. The retrieval of area feature or polygon is a three-step procedure in the chain dictionary model, including:

i.Retrieval of polygon ID,

i. Retrieval of point ID and

ii. Retrieval of corresponding coordinates.

Figure 7: Chain Dictionary Model

In this model the interconnections are between polygons-chains-points where each polygon/area is given a circular list of chains and they are predefined. Each chain is made of number of points and these points are with unique ids and points are noted as a list. Certainly, this model has reduced the space requirement and the access time.

2. Topology model

Unlike Spaghetti model topological model explicitly build relationships. Topology is the term used to describe the making of spatial relationships. In fact, topology adds ‘intelligence’ to the database in a Geospatial environment by explicitly building linkages and evolving relationships. “Topology is one of the most useful relationships maintained in many spatial databases. It is defined as the mathematics of connectivity or adjacency of points or lines that determines spatial relationships in a GIS. The topological data structure logically determines exactly how and where points and lines connect on a map by means of nodes (topological junctions)” Burrough,1987.

In the process of topology building (relationship), points, lines and areas are calculated and encoded. Topological data define the logical connection between points, lines and areas for geographical description and analysis. Connections between spatial objects, for example, information on areas, which bound a line segment, are considered to be topological data. From this adjacent spatial objects may also be identified. The topology of any line would thus include the starting node, its destination node and the left and right polygons through which the line passes.

Figure 8: (A) Topological Model & (B) Topological Warped Model

The topological structure permits as shown in figure 8 , encoding of the geometry of the data with no redundancy. The report (2009) notes, “The geometry of the data is encoded with little or no redundancy in arc-node structure. The database can also include attribute data for each node, arc and polygon. These are expanded in the attribute table, which explicitly links it to the geometry of the spatial object. As for example, the three basic topological relationships in ARC/INFO are as :

i. Connectivity (arcs connect to each other at nodes and provides information about linkages among spatial features.)

“Arc-node topology, was developed several decades ago as a convenient way to store information of this sort. It is used to encode information used in the US Bureau of Census TIGER boundary files and is the basis of the spatial modeling system used by the Arc/Info software system”.

ii. Area definition (Arcs that connect to surround an area define a polygon)

“Polygons are defined by a series of x, y coordinates that connect to enclose an area. Systems like GeoMedia store polygons in this format. But Arc/Info stores the arcs defining the polygon. A list of arcs that make up each polygon is stored and used to construct the polygon as and when required”

iii. Contiguity (arcs have direction and left and right polygon)”

“Every arc has direction (from node and to node). This direction has been maintained in the list of polygons as on the left and right sides of each arc. Thus any polygons sharing a common arc are adjacent”.

The five “axioms in topology are as follows:

All arcs end in points or nodes.

Arcs cannot intersect except at their nodes Areas are completely enclosed by arcs.

Areas do not overlap

Every location is within some area” GIS Fundamentals,p.3.

Figure 9: Non planar and Planar topology in Lines and polygons

Now let us understand the Planar topology (figure 9). All features in planar topology occur in a two dimensional surface. There can be no overlap among lines or polygons in the same layer.These lines are non planar because all lie in the same plane and therefore line crosses over and under line segment. The lines would intersect at node. Similarly in the case of polygons in a planar surface if and when overlap , they need to be resolved as placed one above the other with nodes located at the intersection of the boundaries. If we look closely at figure 9 we will find the presence of three polygons.

In other words, topology defines connections between features, identifies adjacent polygons and can define one feature, such as an area as a set of lines. The rule of thumb for topological data structures is that anything of interest on a map must be explicitly defined as a point, line or area in order for systems to perform any sort of spatial analysis on the data. “The geometric characteristics of a spatial entity may be described in terms of its two-dimensional shape, its distance between like or different entities, how it is connected to other entities and what entities occupy the space adjacent to it”. Such characteristics may be easily described either in words or by a system of coordinate grids on a map. Spatial entities can be described in terms of their dimensional characteristics. If these entities were to be changed by some kind of a transformation, all the characteristics of these spatial entities would likewise change. Shape can be made smaller, lines made longer and so on. In other words, there is a fundamental change to the entity itself. If a map is stretched and distorted, some of its properties may change, for example, distance, direction (angles) and relative location of objects. However, other properties such as ‘next to’, ‘is contained in’ and ‘crosses’ remain unchanged. Therefore, a strict topological property is one that remains unchanged by geometric distortions or transformations of the surface. In topology the neighborhood function and adjacency properties are important features. They are important for performing spatial analyses because we need to know the position of the feature both in absolute space and with respect to its neighboring features. Many of the methods of solving mathematical and geometric relationships work better if we know which areas share common boundaries. Some systems store boundaries as several individual line segments and include arc attributes (or pointers), which indicate which polygon falls on each side of the line segment. By storing common boundaries instead of complete polygon boundaries, duplication in digitizing is avoided, as is the problem where two versions of each common boundary do not coincide.

BOX 1

What is mathematical topology?

“Topology is the branch of mathematics, based on graph theory, which deals with geometric properties and remains unchanged under certain transformation such as bending or stretching. The characteristic of this model is that it explicitly records adjacency information among spatial entities viz. points, lines and polygons. Mathematical topology assumes that geographic features occur on a two-dimensional plane. Through planar enforcement, spatial features can be represented through nodes (0-dimensional cells); edges, sometimes called arcs (one-dimensional cells); or polygons (two-dimensional cells). Because features can exist only on a plane, lines that cross are broken into separate lines that terminate at nodes representing intersections rather than simple vertices. GIS with the aid of topology not only have the power to record location and simple attribute information but also can examine spatial relationships based upon location, as well as functional and logical relationships among geographic features” Notch,1967.

The order of connectivity defines the shape of an arc or polygon. The computer stores this information in various tables of the database structure. By storing information in a logical and ordered relationship missing information, e.g., a line segment of a polygon is readily apparent. A GIS manipulates, analyzes, and uses topological data in determining data relationships.

Summary

Data compaction or compression is common in GIS and is based on different algorithms that reduce the size of a computer file, but maintains all the information intact.
There are generally two types of Compression algorithms where one may be “lossless” (where no information is lost) or “lossy” (where some information is lost).
All four types of Data compression techniques in Raster GIS are relevant and they include the methods called as Chain coding; Run length coding; Block coding and Quad trees. Their limitations should be studied keenly before using them.
Vector data that have been collected but not structured are said to be spaghetti data model and also called as non-topological or geometric or path topological model. It was originally developed to organize, manipulate and store line data
Unlike Spaghetti model topological model explicitly build relationships. Topology is the term used to describe the making of spatial relationships. In fact, topology adds ‘intelligence’ to the database in a Geospatial environment by explicitly building linkages and evolving relationships between each point, line , polygon and attribute information.

Questions:

Classify the compaction techniques used in Raster and Vector data models.
Comment on the role of topological structures in GIS.
Illustrate with three examples from the real world where the statement ,‘Lossy algorithm is largely applied to discrete raster data algorithms’ holds true.
Answer the following questions based on the diagrams given below:

a. Name the compaction technique each figure represent .

b. Which of the following figure looks like a map but without underlying structure?

c. Count the number of lines and number of nodes that each figure represents .

Figure A

Figure B

Figure C

From the following figure answer the number of blocks that each layer represents.

Figure D

i. Building layer

ii. Elevation layer

iii. Roads Layer

iv. Vegetation Layer

you can view video on https://youtu.be/vvOlDQh390M

References

Chang, Kang-tsung (2002) Introduction to Geographic Information Systems, University of Idaho, Tata McGraw-Hill Publishing Company Ltd, New Delhi, 2002, ISBN 0-07-049552-1.
Ganesh, A. Ed. (2006) Applications of Geospatial Technology: Serial Publishing House
Goodchild, M.F. (1997) Geographical Data Modeling, Computers and Geosciences, 18:400-408.
Hohl Pat Ed. (1998) GIS Data Conversion, Strategies, Techniques, Management, Onward Press, 1998
Korte,G,B., (1997) The GIS Book, Understanding the Value and Implementation of Geographic Information Systems, 4th Edition,. Onward Press, 1997.
Parihar, S.M. (2007),, Standardisation in Geo ICT Arena, Vol.5. Issue 5.p.p 40-46. Feb,2007,Geospatial Today, p.p 40-46.
Parihar, S.M.,(2007) Zeroing on Data Errors, Vol.4 . Issue 6. March,2006 ,Geospatial Today, pp 37-42
Parihar, S.M.,(2006) 3-D GIS: Much Awaited Technology, Vol.4 . Issue 11. August,2006 ,Geospatial Today, pp 43-46
Raper, Jonathan, Ed. (1989) Three Dimensional Applications in Geographic Information Systems. Philadelphia, PA: Taylor & Francis, Inc.
Siddiqui, M.A. (2011) Concepts and Techniques of Geoinformatics, Allahabad: Sharda Pustak Bhawan
Worboys, M., & M. Duckham (2004) GIS A computing Perspective, London, CRC Press