Skip to content

merge J2doll/sax reader #428

Closed
j2doll wants to merge 7 commits intomasterfrom
j2doll/sax_reader
Closed

merge J2doll/sax reader #428
j2doll wants to merge 7 commits intomasterfrom
j2doll/sax_reader

Conversation

@j2doll
Copy link
Member

@j2doll j2doll commented Dec 24, 2025

  • Added the ability to read xlsx using xml sax reader.
    • Use less memory compared to DOM structure.

- QXlsx::Document::read_sheet_sax() 
- The read_sheet_sax() function reads the data of a specified sheet using the SAX (Streaming API for XML) method. This function is designed to process each cell's data row by row via a callback, minimizing memory usage and enabling efficient handling of large Excel files. Key options include resolving sharedStrings, outputting formulas as text, and handling empty sheet data.
modifying markdown correcting wrong link
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds SAX (Simple API for XML) reader functionality to QXlsx, providing a memory-efficient alternative to the existing DOM-based approach for reading Excel files. The SAX reader allows streaming cell data through callbacks without loading the entire document structure into memory.

Key Changes:

  • Implemented SAX-based XML parser for reading worksheet cells with lower memory footprint
  • Added Qt 6 compatibility updates for QDateTime API changes
  • Updated documentation and removed obsolete language-specific README references

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
QXlsx/source/xlsxreadsax.cpp New SAX reader implementation with XML parsing logic
QXlsx/header/xlsxreadsax.h Public API definitions for SAX reader options and structures
QXlsx/source/xlsxdocument.cpp Added read_sheet_sax methods to Document class
QXlsx/header/xlsxdocument.h Exposed SAX reader API in Document class
QXlsx/header/xlsxdocument_p.h Added package_bytes storage for SAX reopening
TestExcel/extractdata_sax.cpp Example/test code demonstrating SAX reader usage
TestExcel/main.cpp Integration of new SAX test function
TestExcel/hello.cpp Qt 6 compatibility fix for QDateTime API
TestExcel/readStyle.cpp Initialized return variable
QXlsx/QXlsx.pro, QXlsx/QXlsx.pri, QXlsx/CMakeLists.txt Build configuration updates
TestExcel/TestExcel.pro Added extractdata_sax.cpp to build
README.md, README.*.md Updated documentation links
Example.md Added SAX reader example documentation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


sax_cell c;
c.row = row;
c.col = col;
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sax_cell structure includes a sheet_name field that is never populated in the read_sheet_xml_sax function. The callback receives cells with an empty sheet_name even though the example code in dump_all_sheets_sax expects this field to be populated and uses it in the output.

Suggested change
c.col = col;
c.col = col;
c.sheet_name = sheet_name;

Copilot uses AI. Check for mistakes.
QStringList load_shared_strings_all(ZipReader& zip);

// Parse sheet.xml with SAX
bool read_sheet_xml_sax(const QByteArray& sheet_xml,
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The read_sheet_xml_sax function does not receive the sheet name as a parameter, making it impossible to populate the sax_cell.sheet_name field. The function signature should include the sheet name so that the callback can provide complete cell information.

Suggested change
bool read_sheet_xml_sax(const QByteArray& sheet_xml,
bool read_sheet_xml_sax(const QByteArray& sheet_xml,
const QString& sheet_name,

Copilot uses AI. Check for mistakes.
Comment on lines +91 to +106
QString cell_s;
QString v_text;

while (!rd.atEnd()) {
rd.readNext();

if (rd.isStartElement()) {
const auto name = rd.name();

if (name == QLatin1String("sheetData")) {
in_sheetdata = true;
} else if (in_sheetdata && name == QLatin1String("c")) {
in_c = true;
cell_r = rd.attributes().value(QLatin1String("r")).toString();
cell_t = rd.attributes().value(QLatin1String("t")).toString();
cell_s = rd.attributes().value(QLatin1String("s")).toString();
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable cell_s is declared and assigned but never used. The 's' attribute in Excel XML typically contains style information. Either this variable should be removed or the style information should be included in the sax_cell structure.

Copilot uses AI. Check for mistakes.
Comment on lines +87 to +121
bool in_v = false;

QString cell_r;
QString cell_t;
QString cell_s;
QString v_text;

while (!rd.atEnd()) {
rd.readNext();

if (rd.isStartElement()) {
const auto name = rd.name();

if (name == QLatin1String("sheetData")) {
in_sheetdata = true;
} else if (in_sheetdata && name == QLatin1String("c")) {
in_c = true;
cell_r = rd.attributes().value(QLatin1String("r")).toString();
cell_t = rd.attributes().value(QLatin1String("t")).toString();
cell_s = rd.attributes().value(QLatin1String("s")).toString();
v_text.clear();
} else if (in_c && name == QLatin1String("v")) {
in_v = true;
v_text = rd.readElementText(QXmlStreamReader::IncludeChildElements);
in_v = false;
} else if (in_c && name == QLatin1String("f")) {
const QString f = rd.readElementText(QXmlStreamReader::IncludeChildElements);
if (opt.read_formulas_as_text) {
cell_t = QLatin1String("str");
v_text = f;
}
} else if (in_c && name == QLatin1String("t") && cell_t == QLatin1String("inlineStr")) {
in_v = true;
v_text = rd.readElementText(QXmlStreamReader::IncludeChildElements);
in_v = false;
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The in_v variable is set to true and then immediately set to false on lines 109-111 and 119-121, making it effectively unused. This variable should be removed as it serves no purpose.

Copilot uses AI. Check for mistakes.
return QString::fromLatin1(buf);
}

void create_xlsx(QString xlsx_path)
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter should be passed by const reference (const QString&) rather than by value to avoid unnecessary copying of the QString object.

Copilot uses AI. Check for mistakes.
Comment on lines +52 to +68
QXmlStreamReader rd(xml);

bool in_si = false;
QString acc;

while (!rd.atEnd()) {
rd.readNext();
if (rd.isStartElement()) {
const auto name = rd.name();
if (name == QLatin1String("si")) {
in_si = true;
acc.clear();
} else if (in_si && name == QLatin1String("t")) {
acc += rd.readElementText(QXmlStreamReader::IncludeChildElements);
}
} else if (rd.isEndElement()) {
const auto name = rd.name();
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name "rd" is abbreviated and unclear. Consider using a more descriptive name like "reader" or "xmlReader" to improve code readability.

Suggested change
QXmlStreamReader rd(xml);
bool in_si = false;
QString acc;
while (!rd.atEnd()) {
rd.readNext();
if (rd.isStartElement()) {
const auto name = rd.name();
if (name == QLatin1String("si")) {
in_si = true;
acc.clear();
} else if (in_si && name == QLatin1String("t")) {
acc += rd.readElementText(QXmlStreamReader::IncludeChildElements);
}
} else if (rd.isEndElement()) {
const auto name = rd.name();
QXmlStreamReader xmlReader(xml);
bool in_si = false;
QString acc;
while (!xmlReader.atEnd()) {
xmlReader.readNext();
if (xmlReader.isStartElement()) {
const auto name = xmlReader.name();
if (name == QLatin1String("si")) {
in_si = true;
acc.clear();
} else if (in_si && name == QLatin1String("t")) {
acc += xmlReader.readElementText(QXmlStreamReader::IncludeChildElements);
}
} else if (xmlReader.isEndElement()) {
const auto name = xmlReader.name();

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +34
QString saveAsXslxName = QString::fromLocal8Bit("sample2.xlsx");
doc.saveAs(saveAsXslxName);
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name contains a typo: "saveAsXslxName" should be "saveAsXlsxName" (lowercase 's' in xlsx).

Suggested change
QString saveAsXslxName = QString::fromLocal8Bit("sample2.xlsx");
doc.saveAs(saveAsXslxName);
QString saveAsXlsxName = QString::fromLocal8Bit("sample2.xlsx");
doc.saveAs(saveAsXlsxName);

Copilot uses AI. Check for mistakes.
Comment on lines +10 to +14

struct sax_options
{
bool resolve_shared_strings = true;
bool read_formulas_as_text = false;
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sax_options structure is not documented. Public API structures should include documentation comments explaining the purpose and usage of each field.

Suggested change
struct sax_options
{
bool resolve_shared_strings = true;
bool read_formulas_as_text = false;
/**
* @brief Options controlling SAX-based reading of worksheet XML.
*
* All fields have sensible defaults so that callers can use the
* default-constructed instance for common cases and override only
* the options they care about.
*/
struct sax_options
{
/**
* @brief Whether to resolve shared-string references to their text.
*
* When true, cells that reference entries in sharedStrings.xml are
* returned with their corresponding string value. When false, the
* raw shared-string index or reference is preserved instead.
*/
bool resolve_shared_strings = true;
/**
* @brief Whether to read formulas as plain text instead of evaluating.
*
* When true, cell formulas are exposed as their textual representation
* (e.g. "=A1+B1") rather than any cached or computed result value.
*/
bool read_formulas_as_text = false;
/**
* @brief Whether to stop parsing when encountering an empty sheet data.
*
* When true, the reader may stop processing early if the worksheet
* contains an empty sheetData section, which can be used as a simple
* optimization for fully empty sheets.
*/

Copilot uses AI. Check for mistakes.
Comment on lines +138 to +142
// New feature: Stream sheet cells with callback "without saving"
bool read_sheet_sax(const QString& sheet_name,
const sax_options& opt,
const sax_cell_callback& on_cell);

Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The public API functions read_sheet_sax are not documented. Public API methods should include documentation comments explaining parameters, return values, and behavior.

Suggested change
// New feature: Stream sheet cells with callback "without saving"
bool read_sheet_sax(const QString& sheet_name,
const sax_options& opt,
const sax_cell_callback& on_cell);
// New feature: Stream sheet cells with callback "without saving"
/**
* @brief Reads cells from a worksheet using a SAX-style parser.
*
* These overloads stream cell data from a worksheet without modifying or
* saving the document. For each parsed cell, the provided callback
* (@p on_cell) is invoked with information about that cell. This is
* useful for processing large worksheets without loading the entire
* sheet into memory.
*
* The @p opt parameter controls parsing options such as which parts of
* the worksheet to include or skip during streaming.
*
* @param sheet_name Name of the worksheet to read.
* @param opt Options that configure the SAX reading behavior.
* @param on_cell Callback invoked for each cell encountered while
* streaming the worksheet.
*
* @return @c true if the worksheet was found and read successfully;
* otherwise @c false (for example, if the sheet does not exist
* or an error occurs while reading).
*/
bool read_sheet_sax(const QString& sheet_name,
const sax_options& opt,
const sax_cell_callback& on_cell);
/**
* @brief Reads cells from a worksheet, selected by index, using a
* SAX-style parser.
*
* This overload behaves like the name-based version but selects the
* worksheet by its zero-based index in the workbook.
*
* @param sheet_index Zero-based index of the worksheet to read.
* @param opt Options that configure the SAX reading behavior.
* @param on_cell Callback invoked for each cell encountered while
* streaming the worksheet.
*
* @return @c true if the worksheet was found and read successfully;
* otherwise @c false.
*/

Copilot uses AI. Check for mistakes.
// Parse sheet.xml with SAX
bool read_sheet_xml_sax(const QByteArray& sheet_xml,
const sax_options& opt,
const QStringList* shared_strings, // nullptr 가능
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment contains Korean text ("nullptr 가능" meaning "nullptr possible"). Comments in code should use English for consistency and maintainability.

Suggested change
const QStringList* shared_strings, // nullptr 가능
const QStringList* shared_strings, // can be nullptr

Copilot uses AI. Check for mistakes.
@j2doll j2doll closed this Dec 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants