Sometimes you need to work with .xls
or .xlx
files. In the event that they’re larger than 1M rows, excel will either load the first 1M rows or crash. This article will demonstrate how to load such files in python.
Background Info on Excel File Structure
Before we dive in, here’s some background information on how spreadsheets work on the backend.
I. Data Structure
Spreadsheets store data as row-wise arrays consisting of column keys with cell values.¹
The functionality of spreadsheets deride from their behavior as a directed acyclic graph (DAG) which allow cyclical (arbitrary) relational functionality.² This is best demonstrated with the illustration below which shows how some cell values are derived from their relationships to others.
II. Size Limits
Current size limits for excel are 1,048,576 rows by 16,384 columns — owing to memory resources. To be clear, data can be stored in an excel file which breaks these rules — but will not function in the excel program.