Sheet data management

How to manage cells data in memory efficiently?

Paging

Using a default two-dimensional array to manage the cells data will take many lot of memory, even there are no elements be initialized. One idea is to allocate memory only for the used arrays which have attached elements. And reduce the memory usage by separating one large array to a lot of small arrays, that is paging.

An image of paging as below:
d02.png

Before finding a cell, it is need to find its page firstly. For an one-dimensional array to calculate its page-index and cell-index:
int pageIndex = index / pageSize;
int cellIndex = index % pageSize;

If the size of page is power of 2, it will be more effective to use bit-computing instead of division.

Tree

By combining some pages, link them into a larger page as parent index-page, can help to decrease the memory usage.
d01.png

All the cells data will be attached to the most bottom of index-pages, for example:
46.png

Any never been accessed index-pages, and the cells will be null, they do not use memory.
47.png

To access a cell, according to the depth of tree, an index of cell will be calculated two times (the depth of tree is 2):
48.png

Smaller size of page will take fewer memory, but if they are too small, a more deeper tree is necessary to be created in order to get larger capacity of array. The deeper tree will spend more time during once cell accessing. Consider to get a good balance between page-size and tree-depth is most important. Currently ReoGrid uses the following values of page size:
  • Number of rows: 128
  • Number of columns: 32
  • Depth of tree: 3
So the physical maximum supported capacity of array is:
  • Number of rows: 128^3 = 2,097,152
  • Number of columns: 32^3 = 32,768
  • (Cell elements up to 2,097,152 x 32,768 = 68,719,476,736)
  • (Maximum memory usage is 68,719,476,736 * 1.5 * 8 / 1024 / 1024 / 1024 = 768GB)
Although the row capacity of cells can be 2,097,152, but there is need to set a limit on rows to 1,048,576 in order to fit the data struct of header of rows.

Code implementation

To calculate index, get and set array elements by paging-indexed two-dimensional array in ReoGrid, the getter and setter looks like:
get
{
	Node node = root;

	// MaxDepth is 3
	for (int d = MaxDepth - 1; d > 0; d--)                 
	{
		// RowBitLen is 7 when row page size is 128
		int r = (row >> (RowBitLen * d)) % RowSize;    
		// ColBitLen is 5 when col page size is 32
		int c = (col >> (ColBitLen * d)) % ColSize;

		node = node.nodes[r, c];

		if (node == null) return default(T);
	}

	// RowSize is 128, and ColSize is 32
	return node.data[row % RowSize, col % ColSize];        
}

Setter:
set
{
	Node node = root;

	for (int d = MaxDepth - 1; d > 0; d--) {
		// RowBitLen is 7 when row page size is 128
		int r = (row >> (RowBitLen * d)) % RowSize;    
                // ColBitLen is 5 when col page size is 32
		int c = (col >> (ColBitLen * d)) % ColSize;    

		Node child = node.nodes[r, c];

		if (child == null) {
			if (value == null) {
				return;
			} else {
				child = node.nodes[r, c] = new Node();
				if (d > 1) { 
					child.nodes = new Node[RowSize, ColSize];
				}
			}
		}

		node = child;
	}

	if (node.data == null) {
		if (value == null) {
			return;
		} else {
			node.data = new T[RowSize, ColSize];
		}
	}

	node.data[row % RowSize, col % ColSize] = value;
}

What to be improved

There is an issue, also a performance issue, is ReoGrid doesn't know the boundary of data. Think about when user selecting entire row or column, and when ReoGrid is requested to calculate the sum value. Only one thing to do is get the data from cells one by one, and calculate their sum, it will take very long time if there is huge sheet.

Solution 1: Add valid data boundary to each pages?

Something like adding the data boundary to identify the valid data starting at where, and ending at where, seems can improve the iteration speed. The code like as below:
double sum = 0;
foreach(var page in sheetData) {
  for(int r = page.minRow; r <= page.maxRow; r++) {
    for(int c = page.minCol; c <= page.maxCol; c++) {
      sum += page[r, c].data;
    }
  }
}

Solution 2: Valid row and valid column identification

Add flag to row headers and column headers to identify whether this row or column has any data. The code like as below:
double sum = 0;

for (int r = 0; r < sheet.RowCount; r++) {
  if(sheet.IsEmptyRow(r)) continue;

  for (int c = 0; c < sheet.ColCount; c++) {
    if(sheet.IsEmptyCol(c)) continue;

    sum += sheet[r, c].data;
  }
}

In fact seems mix the solution 1 and 2 will be more better :)

Performance Test

Saturday, January 11, 2014

Test Item total cycles elapsed (ms) one cycle memory usage
Sequential Read & Write 2,000,000 161 0.0805 μs 9.6 MB
- Number of rows: 1000
- Number of columns: 1000
Sequential Read & Write 20,000,000 1624 0.0812 μs 46 MB
- Number of rows: 100,000
- Number of columns: 100
Interval Read & Write 20,000 67 3.35 μs 35 MB
- Number of rows: 10000
- Number of columns: 10000
- Interval: 100
Random Read & Write 20,000 183 9.15 μs 73 MB
- Row Range: 0-10000
- Column Range: 0-10000
Iterate Over Empty Range 200,000,000 152 0.00076 μs -
- Number of rows: 1,000,000
- Number of columns: 200
Fill 1,000,000 Rows 1,000,000 130 0.13 μs 145 MB
- Number of rows: 1,000,000
- Number of columns: 200
Sum 1,000,000 Rows (Iterator) 1,000,000 146 0.146 μs -
- Number of rows: 1,000,000
- Number of columns: 1
Sum 1,000,000 Rows (For) 1,000,000 67 0.067 μs -
- Number of rows: 1,000,000
- Number of columns: 1

Excel: DataStructPerformance20140111.xlsx

I found a bug that is first level of tree will not be used, and the depth of tree will be +1 than the depth as expected. After fix this, performance has been a little improved. But I don't know that why random accessing became slower.

Monday, January 13, 2014

Test Item total cycles elapsed (ms) one cycle memory usage
Sequential Read & Write 2,000,000 130 0.065 μs 9.6 MB
- Number of rows: 1000
- Number of columns: 1000
Sequential Read & Write 20,000,000 1297 0.06485 μs 46 MB
- Number of rows: 100,000
- Number of columns: 100
Interval Read & Write 20,000 81 4.05 μs 35 MB
- Number of rows: 10000
- Number of columns: 10000
- Interval: 100
Random Read & Write 20,000 263 13.15 μs 73 MB
- Row Range: 0-10000
- Column Range: 0-10000
Iterate Over Empty Range 200,000,000 150 0.00075 μs -
- Number of rows: 1,000,000
- Number of columns: 200
Fill 1,000,000 Rows 1,000,000 110 0.11 μs 145 MB
- Number of rows: 1,000,000
- Number of columns: 200
Sum 1,000,000 Rows (Iterator) 1,000,000 113 0.113 μs -
- Number of rows: 1,000,000
- Number of columns: 1
Sum 1,000,000 Rows (For) 1,000,000 50 0.05 μs -
- Number of rows: 1,000,000
- Number of columns: 1

Excel: DataStructPerformance20140113.xlsx

Last edited Jan 14, 2014 at 1:44 PM by unvell, version 39