Batch upsert documents by levkk · Pull Request #1539

Batch upsert documents by levkk · Pull Request #1539 · postgresml/postgresml

Python world are used to using batching systems already built into the dataset

Datasets are one of very many sources for data. For example, my use case that triggered the desire for this feature was streaming WET files from a warcio.archiveiterator.ArchiveIterator which seemingly doesn't have batching support built-in. This is not uncommon for most non-machine learning libraries and toolkits which people use to build regular web apps. Is it easy to write the batching logic yourself? Seemingly so, but it's really easy to forget to flush the last often incomplete batch when the source stream is complete, especially when you have to do it yourself, over and over, for each use case you have in your code.

Why not use the batch_size argument on Collection.upsert_documents for this functionality?

batch_size doesn't handle the incomplete batch scenario, where a user inserts len(records) % batch_size != 0, hence the need for finish() aka flush(). You have to tell the collection when you're done writing and no more records will be added to whatever incomplete batch it's been buffering.