Vince's CSV Parser
- Vince's CSV Parser
Motivation
There's plenty of other CSV parsers in the wild, but I had a hard time finding what I wanted. Inspired by Python's csv module, I wanted a library with simple, intuitive syntax. Furthermore, I wanted support for special use cases such as calculating statistics on very large files. Thus, this library was created with these following goals in mind.
Performance and Memory Requirements
A high performance CSV parser allows you to take advantage of the deluge of large datasets available. By using overlapped threads, memory mapped IO, and minimal memory allocation, this parser can quickly tackle large CSV files--even if they are larger than RAM.
In fact, according to Visual Studio's profiler this CSV parser spends almost 90% of its CPU cycles actually reading your data as opposed to getting hung up in hard disk I/O or pushing around memory.
Show me the numbers
On my computer (12th Gen Intel(R) Core(TM) i5-12400 @ 2.50 GHz/Western Digital Blue 5400RPM HDD), this parser can read
- the 69.9 MB 2015_StateDepartment.csv in 0.19 seconds (360 MBps)
- a 1.4 GB Craigslist Used Vehicles Dataset in 1.18 seconds (1.2 GBps)
- a 2.9GB Car Accidents Dataset in 8.49 seconds (352 MBps)
Chunk Size Tuning
By default, the parser reads CSV data in 10MB chunks. This balance was determined through empirical testing to optimize throughput while minimizing memory overhead and thread synchronization costs.
If you encounter rows larger than the chunk size, pass a custom CSVFormat with chunk_size():
CSVFormat fmt; fmt.chunk_size(100 * 1024 * 1024); // 100MB chunks CSVReader reader("massive_rows.csv", fmt); for (auto& row : reader) { // Process row }
Tuning guidance: The default 10MB provides good balance for typical workloads. Smaller chunks (e.g., 500KB) increase thread overhead without meaningful memory savings. Larger chunks (e.g., 100MB+) reduce thread coordination overhead but consume more memory and delay the first row. Feel free to experiment and measure with your own hardware and data patterns.
Robust Yet Flexible
RFC 4180 and Beyond
This CSV parser is much more than a fancy string splitter, and parses all files following RFC 4180.
However, in reality we know that RFC 4180 is just a suggestion, and there's many "flavors" of CSV such as tab-delimited files. Thus, this library has:
- Automatic delimiter guessing
- Ability to ignore comments in leading rows and elsewhere
- Ability to handle rows of different lengths
- Ability to handle arbitrary line endings (as long as they are some combination of carriage return and newline)
By default, rows of variable length are silently ignored, although you may elect to keep them or throw an error.
Encoding
This CSV parser is encoding-agnostic and will handle ANSI and UTF-8 encoded files. It does not try to decode UTF-8, except for detecting and stripping UTF-8 byte order marks.
Well Tested
This CSV parser has:
- An extensive Catch2 test suite
- Tests of various CMake and non-CMake builds across g++, clang, MSVC, and MinGW
- Address, thread safety, and undefined behavior checks with ASan, TSan, and Valgrind (see GitHub Actions)
Bug Reports
Found a bug? Please report it! This project welcomes genuine bug reports brought in good faith:
- ✅ Crashes, memory leaks, data corruption, race conditions
- ✅ Incorrect parsing of valid CSV files
- ✅ Performance regressions in real-world scenarios
- ✅ API issues that affect practical, real-world use cases
When reporting integration or compiler issues, please state which library form you are using:
- Single-header
- Unamalgamated headers/library (
include/with your own build system, CMake, etc.)
Please keep reports grounded in real use cases—no contrived edge cases or philosophical debates about API design, thanks!
Design Note: CSVReader uses std::input_iterator_tag for single-pass streaming of arbitrarily large files. If you need multi-pass iteration or random access, copy rows to a std::vector first. This is by design, not a bug.
Documentation
In addition to the Features & Examples below, a fully-fledged online documentation contains more examples, details, interesting features, and instructions for less common use cases.
Sponsors
If you use this library for work, please become a sponsor. Your donation will fund continued maintenance and development of the project.
Shameless plug: If you like this library, check out my side project experiencer — a WYSIWYG resume editor with clean HTML/CSS output.
Integration
This library was developed with Microsoft Visual Studio and is compatible with >g++ 7.5 and clang.
All of the code required to build this library, aside from the C++ standard library, is contained under include/.
C++ Version
While C++17 is recommended, C++11 is the minimum version required. This library makes extensive use of string views, and uses
Martin Moene's string view library if std::string_view is not available.
This library requires C++ exceptions to be enabled (for example, do not compile with -fno-exceptions).
Single Header
📥 Download csv.hpp — Available on GitHub Pages
Or copy the URL:
https://vincentlaucsb.github.io/csv-parser/csv.hpp
The file is automatically generated and deployed on every commit to master, ensuring you always have the latest version.
CMake Instructions
If you're including this in another CMake project, you can simply clone this repo into your project directory, and add the following to your CMakeLists.txt:
# Optional: Defaults to C++ 17
# set(CSV_CXX_STANDARD 11)
add_subdirectory(csv-parser)
# ...
add_executable(<your program> ...)
target_link_libraries(<your program> csv)
Avoid cloning with FetchContent
Don't want to clone? No problem. There's also a simple example documenting how to use CMake's FetchContent module to integrate this library.
Features & Examples
Reading an Arbitrarily Large File (with Iterators)
With this library, you can easily stream over a large file without reading its entirety into memory.
C++ Style
# include "csv.hpp" using namespace csv; ... CSVReader reader("very_big_file.csv"); for (CSVRow& row: reader) { // Input iterator for (CSVField& field: row) { // By default, get<>() produces a std::string. // A more efficient get<string_view>() is also available, where the resulting // string_view is valid as long as the parent CSVRow is alive std::cout << field.get<>() << ... } } ...
Old-Fashioned C Style Loop
... CSVReader reader("very_big_file.csv"); CSVRow row; while (reader.read_row(row)) { // Do stuff with row here } ...
Memory-Mapped Files vs. Streams
By default, passing in a file path string to the constructor of CSVReader
causes memory-mapped IO to be used. In general, this option is the most
performant.
However, std::ifstream may also be used as well as in-memory sources via std::stringstream.
Note: Currently CSV guessing only works for memory-mapped files. The CSV dialect must be manually defined for other sources.
⚠️ IMPORTANT - Iterator Type and Memory Safety:
CSVReader::iterator is an input iterator (std::input_iterator_tag), NOT a forward iterator.
This design enables streaming large CSV files (50+ GB) without loading them entirely into memory.
Why Forward Iterator Algorithms Don't Work:
- As the iterator advances, underlying data chunks are automatically freed to bound memory usage
- Algorithms like
std::max_elementrequire ForwardIterator semantics (multi-pass, hold multiple positions) - Using such algorithms directly on
CSVReader::iteratorwill cause heap-use-after-free when the algorithm tries to access iterators pointing to already-freed data chunks - While it may appear to work with small files that fit in a single chunk, it WILL fail with larger files
✅ Correct Approach for ForwardIterator Algorithms:
// Copy rows to vector first (enables multi-pass iteration) CSVReader reader("large_file.csv"); std::vector<CSVRow> rows(reader.begin(), reader.end()); // Now safely use any algorithm requiring ForwardIterator auto max_row = std::max_element(rows.begin(), rows.end(), [](const CSVRow& a, const CSVRow& b) { return a["salary"].get<double>() < b["salary"].get<double>(); });
CSVFormat format; // custom formatting options go here CSVReader mmap("some_file.csv", format); std::ifstream infile("some_file.csv", std::ios::binary); CSVReader ifstream_reader(infile, format); std::stringstream my_csv; CSVReader sstream_reader(my_csv, format);
Indexing by Column Names
Retrieving values using a column name string is a cheap, constant time operation.
# include "csv.hpp" using namespace csv; ... CSVReader reader("very_big_file.csv"); double sum = 0; for (auto& row: reader) { // Note: Can also use index of column with [] operator sum += row["Total Salary"].get<double>(); } ...
Numeric Conversions
If your CSV has lots of numeric values, you can also have this parser (lazily) convert them to the proper data type.
try_get<T>()is a non-throwing version ofget<T>which returnsboolif the conversion was successful- Type checking is performed on conversions to prevent undefined behavior and integer overflow
- Negative numbers cannot be blindly converted to unsigned integer types
get<float>(),get<double>(), andget<long double>()are capable of parsing numbers written in scientific notation.- Note: Conversions to floating point types are not currently checked for loss of precision.
# include "csv.hpp" using namespace csv; ... CSVReader reader("very_big_file.csv"); for (auto& row: reader) { int timestamp = 0; if (row["timestamp"].try_get(timestamp)) { // Non-throwing conversion std::cout << "Timestamp: " << timestamp << std::endl; } if (row["timestamp"].is_int()) { // Can use get<>() with any integer type, but negative // numbers cannot be converted to unsigned types row["timestamp"].get<int>(); // You can also attempt to parse hex values long long value; if (row["hexValue"].try_parse_hex(value)) { std::cout << "Hex value is " << value << std::endl; } // Or specify a different integer type int smallValue; if (row["smallHex"].try_parse_hex<int>(smallValue)) { std::cout << "Small hex value is " << smallValue << std::endl; } // Non-imperial decimal numbers can be handled this way long double decimalValue; if (row["decimalNumber"].try_parse_decimal(decimalValue, ',')) { std::cout << "Decimal value is " << decimalValue << std::endl; } // .. } }
Converting to JSON
You can serialize individual rows as JSON objects, where the keys are column names, or as JSON arrays (which don't contain column names). The outputted JSON contains properly escaped strings with minimal whitespace and no quoting for numeric values. How these JSON fragments are assembled into a larger JSON document is an exercise left for the user.
# include <sstream> # include "csv.hpp" using namespace csv; ... CSVReader reader("very_big_file.csv"); std::stringstream my_json; for (auto& row: reader) { my_json << row.to_json() << std::endl; my_json << row.to_json_array() << std::endl; // You can pass in a vector of column names to // slice or rearrange the outputted JSON my_json << row.to_json({ "A", "B", "C" }) << std::endl; my_json << row.to_json_array({ "C", "B", "A" }) << std::endl; }
Specifying the CSV Format
Although the CSV parser has a decent guessing mechanism, in some cases it is preferrable to specify the exact parameters of a file.
# include "csv.hpp" # include ... using namespace csv; CSVFormat format; format.delimiter('\t') .quote('~') .header_row(2); // Header is on 3rd row (zero-indexed) // .no_header(); // Parse CSVs without a header row // .quote(false); // Turn off quoting // Alternatively, we can use format.delimiter({ '\t', ',', ... }) // to tell the CSV guesser which delimiters to try out CSVReader reader("weird_csv_dialect.csv", format); for (auto& row: reader) { // Do stuff with rows here }
Trimming Whitespace
This parser can efficiently trim off leading and trailing whitespace. Of course, make sure you don't include your intended delimiter or newlines in the list of characters to trim.
CSVFormat format;
format.trim({ ' ', '\t' });Handling Variable Numbers of Columns
Sometimes, the rows in a CSV are not all of the same length. Whether this was intentional or not, this library is built to handle all use cases.
CSVFormat format; // Default: Silently ignoring rows with missing or extraneous columns format.variable_columns(false); // Short-hand format.variable_columns(VariableColumnPolicy::IGNORE_ROW); // Case 2: Keeping variable-length rows format.variable_columns(true); // Short-hand format.variable_columns(VariableColumnPolicy::KEEP); // Case 3: Throwing an error if variable-length rows are encountered format.variable_columns(VariableColumnPolicy::THROW);
Setting Column Names
If a CSV file does not have column names, you can specify your own:
std::vector<std::string> col_names = { ... };
CSVFormat format;
format.column_names(col_names);Parsing an In-Memory String
# include "csv.hpp" using namespace csv; ... // Method 1: Using parse() std::string csv_string = "Actor,Character\r\n" "Will Ferrell,Ricky Bobby\r\n" "John C. Reilly,Cal Naughton Jr.\r\n" "Sacha Baron Cohen,Jean Giard\r\n"; auto rows = parse(csv_string); for (auto& r: rows) { // Do stuff with row here } // Method 2: Using _csv operator auto rows = "Actor,Character\r\n" "Will Ferrell,Ricky Bobby\r\n" "John C. Reilly,Cal Naughton Jr.\r\n" "Sacha Baron Cohen,Jean Giard\r\n"_csv; for (auto& r: rows) { // Do stuff with row here }
DataFrames for Random Access and Updates
For files that fit comfortably in memory, DataFrame provides fast and powerful keyed access, in-place updates, and grouping operations—all built on the same high-performance parser. It uses the same parsing pipeline as CSVReader but retains the results in memory for random access.
Creating a DataFrame with Keyed Access
# include "csv.hpp" using namespace csv; ... // Shortest form: pass a filename directly with DataFrameOptions DataFrame<int> df("employees.csv", DataFrameOptions().set_key_column("employee_id")); // Or construct from an existing CSVReader (e.g. when you need a custom format) CSVReader reader("employees.csv"); DataFrame<int> df2(reader, "employee_id"); // O(1) lookups by key auto salary = df[12345]["salary"].get<double>(); // Positional access: operator[](size_t) is disabled when KeyType is an integer // type to prevent ambiguity with operator[](const KeyType&). Use iloc() instead. auto first_row = df.iloc(0); auto name = first_row["name"].get<std::string>(); // Check if a key exists if (df.contains(99999)) { std::cout << "Employee exists" << std::endl; }
Using DataFrameOptions for Fine-Grained Control
// Configure key column, duplicate-key policy, and missing-key behaviour DataFrameOptions opts; opts.set_key_column("employee_id") .set_duplicate_key_policy( DataFrameOptions::DuplicateKeyPolicy::KEEP_FIRST) // or OVERWRITE / THROW .set_throw_on_missing_key(false); // silently skip rows with no key value DataFrame<int> df("employees.csv", opts);
Creating a DataFrame with a Custom Key Function
CSVReader reader("employees.csv"); // Build a composite key from two columns auto make_key = [](const CSVRow& row) { return row["first_name"].get<std::string>() + "_" + row["last_name"].get<std::string>(); }; DataFrame<std::string> by_name(reader, make_key); // Lookups by composite key auto employee = by_name["Ada_Lovelace"]["department"].get<std::string>();
Updating Values
// Updates are stored in an efficient overlay without copying the entire dataset df.set(12345, "salary", "95000"); df.set(67890, "department", "Engineering"); // Access methods return updated values transparently std::cout << df[12345]["salary"].get<std::string>(); // "95000" // Iterate with edits visible for (auto& row : df) { std::cout << row["salary"].get<std::string>(); // Shows edited values }
Grouping and Analysis
// Group by department auto groups = df.group_by("department"); for (auto& [dept, row_indices] : groups) { double total_salary = 0; for (size_t i : row_indices) { total_salary += df[i]["salary"].get<double>(); } std::cout << dept << " total: $" << total_salary << std::endl; } // Group using a custom function auto by_salary_range = df.group_by([](const CSVRow& row) { double salary = row["salary"].get<double>(); return salary < 50000 ? "junior" : salary < 100000 ? "mid" : "senior"; });
Writing Back to CSV
Each DataFrameRow has an implicit conversion to std::vector<std::string>,
which is convenient when using CSVWriter.
// DataFrameRow has implicit conversion for CSVWriter compatibility auto writer = make_csv_writer(std::cout); for (auto& row : df) { writer << row; // Outputs edited values }
When to Use DataFrame vs. CSVReader:
- Use CSVReader for: Large files (>1GB), streaming pipelines, minimal memory footprint
- Use DataFrame for: Files that fit in RAM, frequent lookups/updates, grouping operations, data that needs random access
When Not to Use DataFrame:
- Extremely large files that do not fit in RAM
- Streaming pipelines where you only need single-pass access
Both options deliver the same parsing performance—DataFrame simply keeps the results in memory for convenience.
Writing CSV Files
# include "csv.hpp" # include ... using namespace csv; using namespace std; ... stringstream ss; // Can also use ofstream, etc. auto writer = make_csv_writer(ss); // auto writer = make_tsv_writer(ss); // For tab-separated files // DelimWriter<stringstream, '|', '"'> writer(ss); // Your own custom format // set_decimal_places(2); // How many places after the decimal will be written for floats writer << vector<string>({ "A", "B", "C" }) << deque<string>({ "I'm", "too", "tired" }) << list<string>({ "to", "write", "documentation." }); writer << array<string, 3>({ "The quick brown", "fox", "jumps over the lazy dog" }); writer << make_tuple(1, 2.0, "Three"); ...
You can pass in arbitrary types into DelimWriter by defining a conversion function
for that type to std::string.