Sometimes you need to work with
.xlx files. In the event that they’re larger than 1M rows, excel will either load the first 1M rows or crash. This article will demonstrate how to load such files in python.
Background Info on Excel File Structure
Before we dive in, here’s some background information on how spreadsheets work on the backend.
I. Data Structure
Spreadsheets store data as row-wise arrays consisting of column keys with cell values.¹
The functionality of spreadsheets deride from their behavior as a directed acyclic graph (DAG) which allow cyclical (arbitrary) relational functionality.² This is best demonstrated with the illustration below which shows how some cell values are derived from their relationships to others.
II. Size Limits
Current size limits for excel are 1,048,576 rows by 16,384 columns — owing to memory resources. To be clear, data can be stored in an excel file which breaks these rules — but will not function in the excel program.
If you were to open an excel file of 1.25M rows, the program will drop all data below row 1M.
III. Q: If excel is so limited, why use it?
The easiest answer is that the file itself is easily shareable and provides better encryption than csv, txt, or json structure. Also, excel’s UI is incredibly user friendly and makes inspecting data really easy.
Some specific situations where I’ve received massive excel files:
- External access not-allowed to database — existing user exports data to encrypted excel format and shares.
- Data is being collected/generated by non-technical team. (Ex. agile marketing project.)
- Data needs to be inspected by legal /…