METADATA 62 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916016116216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320420520620720820921021121221321421521621721821922022122222322422522622722822923023123223323423523623723823924024124224324424524624724824925025125225325425525625725825926026126226326426526626726826927027127227327427527627727827928028128228328428528628728828929029129229329429529629729829930030130230330430530630730830931031131231331431531631731831932032132232332432532632732832933033133233333433533633733833934034134234334434534634734834935035135235335435535635735835936036136236336436536636736836937037137237337437537637737837938038138238338438538638738838939039139239339439539639739839940040140240340440540640740840941041141241341441541641741841942042142242342442542642742842943043143243343443543643743843944044144244344444544644744844945045145245345445545645745845946046146246346446546646746846947047147247347447547647747847948048148248348448548648748848949049149249349449549649749849950050150250350450550650750850951051151251351451551651751851952052152252352452552652752852953053153253353453553653753853954054154254354454554654754854955055155255355455555655755855956056156256356456556656756856957057157257357457557657757857958058158258358458558658758858959059159259359459559659759859960060160260360460560660760860961061161261361461561661761861962062162262362462562662762862963063163263363463563663763863964064164264364464564664764864965065165265365465565665765865966066166266366466566666766866967067167267367467567667767867968068168268368468568668768868969069169269369469569669769869970070170270370470570670770870971071171271371471571671771871972072172272372472572672772872973073173273373473573673773873974074174274374474574674774874975075175275375475575675775875976076176276376476576676776876977077177277377477577677777877978078178278378478578678778878979079179279379479579679779879980080180280380480580680780880981081181281381481581681781881982082182282382482582682782882983083183283383483583683783883984084184284384484584684784884985085185285385485585685785885986086186286386486586686786886987087187287387487587687787887988088188288388488588688788888989089189289389489589689789889990090190290390490590690790890991091191291391491591691791891992092192292392492592692792892993093193293393493593693793893994094194294394494594694794894995095195295395495595695795895996096196296396496596696796896997097197297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007100810091010101110121013101410151016101710181019102010211022102310241025102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064106510661067106810691070107110721073107410751076107710781079108010811082108310841085108610871088108910901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163116411651166116711681169117011711172117311741175117611771178117911801181118211831184118511861187118811891190119111921193119411951196119711981199120012011202120312041205120612071208120912101211121212131214121512161217121812191220122112221223122412251226122712281229123012311232123312341235123612371238123912401241124212431244124512461247124812491250125112521253125412551256125712581259126012611262126312641265126612671268126912701271127212731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295129612971298129913001301130213031304130513061307130813091310131113121313131413151316131713181319132013211322132313241325132613271328132913301331133213331334133513361337133813391340134113421343134413451346134713481349135013511352135313541355135613571358135913601361136213631364136513661367136813691370137113721373137413751376137713781379138013811382138313841385138613871388138913901391139213931394139513961397139813991400140114021403140414051406140714081409141014111412141314141415141614171418141914201421142214231424142514261427142814291430143114321433143414351436143714381439144014411442144314441445144614471448144914501451145214531454145514561457145814591460146114621463146414651466146714681469147014711472147314741475147614771478147914801481148214831484148514861487148814891490149114921493149414951496149714981499150015011502150315041505150615071508150915101511151215131514151515161517151815191520152115221523152415251526152715281529153015311532153315341535153615371538153915401541154215431544154515461547154815491550155115521553155415551556155715581559156015611562156315641565156615671568156915701571157215731574157515761577157815791580158115821583158415851586158715881589159015911592159315941595159615971598159916001601160216031604160516061607160816091610161116121613161416151616161716181619162016211622162316241625162616271628162916301631163216331634163516361637
  1. Metadata-Version: 2.1
  2. Name: zstandard
  3. Version: 0.14.1
  4. Summary: Zstandard bindings for Python
  5. Home-page: https://github.com/indygreg/python-zstandard
  6. Author: Gregory Szorc
  7. Author-email: gregory.szorc@gmail.com
  8. License: BSD
  9. Keywords: zstandard zstd compression
  10. Platform: UNKNOWN
  11. Classifier: Development Status :: 4 - Beta
  12. Classifier: Intended Audience :: Developers
  13. Classifier: License :: OSI Approved :: BSD License
  14. Classifier: Programming Language :: C
  15. Classifier: Programming Language :: Python :: 2.7
  16. Classifier: Programming Language :: Python :: 3.5
  17. Classifier: Programming Language :: Python :: 3.6
  18. Classifier: Programming Language :: Python :: 3.7
  19. Classifier: Programming Language :: Python :: 3.8
  20. ================
  21. python-zstandard
  22. ================
  23. This project provides Python bindings for interfacing with the
  24. `Zstandard <http://www.zstd.net>`_ compression library. A C extension
  25. and CFFI interface are provided.
  26. The primary goal of the project is to provide a rich interface to the
  27. underlying C API through a Pythonic interface while not sacrificing
  28. performance. This means exposing most of the features and flexibility
  29. of the C API while not sacrificing usability or safety that Python provides.
  30. The canonical home for this project lives in a Mercurial repository run by
  31. the author. For convenience, that repository is frequently synchronized to
  32. https://github.com/indygreg/python-zstandard.
  33. | |ci-status|
  34. Requirements
  35. ============
  36. This extension is designed to run with Python 2.7, 3.5, 3.6, 3.7, and 3.8
  37. on common platforms (Linux, Windows, and OS X). On PyPy (both PyPy2 and PyPy3) we support version 6.0.0 and above.
  38. x86 and x86_64 are well-tested on Windows. Only x86_64 is well-tested on Linux and macOS.
  39. Installing
  40. ==========
  41. This package is uploaded to PyPI at https://pypi.python.org/pypi/zstandard.
  42. So, to install this package::
  43. $ pip install zstandard
  44. Binary wheels are made available for some platforms. If you need to
  45. install from a source distribution, all you should need is a working C
  46. compiler and the Python development headers/libraries. On many Linux
  47. distributions, you can install a ``python-dev`` or ``python-devel``
  48. package to provide these dependencies.
  49. Packages are also uploaded to Anaconda Cloud at
  50. https://anaconda.org/indygreg/zstandard. See that URL for how to install
  51. this package with ``conda``.
  52. Legacy Format Support
  53. =====================
  54. To enable legacy zstd format support which is needed to handle files compressed
  55. with zstd < 1.0 you need to provide an installation option::
  56. $ pip install zstandard --install-option="--legacy"
  57. and since pip 7.0 it is possible to have the following line in your
  58. requirements.txt::
  59. zstandard --install-option="--legacy"
  60. Performance
  61. ===========
  62. zstandard is a highly tunable compression algorithm. In its default settings
  63. (compression level 3), it will be faster at compression and decompression and
  64. will have better compression ratios than zlib on most data sets. When tuned
  65. for speed, it approaches lz4's speed and ratios. When tuned for compression
  66. ratio, it approaches lzma ratios and compression speed, but decompression
  67. speed is much faster. See the official zstandard documentation for more.
  68. zstandard and this library support multi-threaded compression. There is a
  69. mechanism to compress large inputs using multiple threads.
  70. The performance of this library is usually very similar to what the zstandard
  71. C API can deliver. Overhead in this library is due to general Python overhead
  72. and can't easily be avoided by *any* zstandard Python binding. This library
  73. exposes multiple APIs for performing compression and decompression so callers
  74. can pick an API suitable for their need. Contrast with the compression
  75. modules in Python's standard library (like ``zlib``), which only offer limited
  76. mechanisms for performing operations. The API flexibility means consumers can
  77. choose to use APIs that facilitate zero copying or minimize Python object
  78. creation and garbage collection overhead.
  79. This library is capable of single-threaded throughputs well over 1 GB/s. For
  80. exact numbers, measure yourself. The source code repository has a ``bench.py``
  81. script that can be used to measure things.
  82. API
  83. ===
  84. To interface with Zstandard, simply import the ``zstandard`` module::
  85. import zstandard
  86. It is a popular convention to alias the module as a different name for
  87. brevity::
  88. import zstandard as zstd
  89. This module attempts to import and use either the C extension or CFFI
  90. implementation. On Python platforms known to support C extensions (like
  91. CPython), it raises an ImportError if the C extension cannot be imported.
  92. On Python platforms known to not support C extensions (like PyPy), it only
  93. attempts to import the CFFI implementation and raises ImportError if that
  94. can't be done. On other platforms, it first tries to import the C extension
  95. then falls back to CFFI if that fails and raises ImportError if CFFI fails.
  96. To change the module import behavior, a ``PYTHON_ZSTANDARD_IMPORT_POLICY``
  97. environment variable can be set. The following values are accepted:
  98. default
  99. The behavior described above.
  100. cffi_fallback
  101. Always try to import the C extension then fall back to CFFI if that
  102. fails.
  103. cext
  104. Only attempt to import the C extension.
  105. cffi
  106. Only attempt to import the CFFI implementation.
  107. In addition, the ``zstandard`` module exports a ``backend`` attribute
  108. containing the string name of the backend being used. It will be one
  109. of ``cext`` or ``cffi`` (for *C extension* and *cffi*, respectively).
  110. The types, functions, and attributes exposed by the ``zstandard`` module
  111. are documented in the sections below.
  112. .. note::
  113. The documentation in this section makes references to various zstd
  114. concepts and functionality. The source repository contains a
  115. ``docs/concepts.rst`` file explaining these in more detail.
  116. ZstdCompressor
  117. --------------
  118. The ``ZstdCompressor`` class provides an interface for performing
  119. compression operations. Each instance is essentially a wrapper around a
  120. ``ZSTD_CCtx`` from the C API.
  121. Each instance is associated with parameters that control compression
  122. behavior. These come from the following named arguments (all optional):
  123. level
  124. Integer compression level. Valid values are between 1 and 22.
  125. dict_data
  126. Compression dictionary to use.
  127. Note: When using dictionary data and ``compress()`` is called multiple
  128. times, the ``ZstdCompressionParameters`` derived from an integer
  129. compression ``level`` and the first compressed data's size will be reused
  130. for all subsequent operations. This may not be desirable if source data
  131. size varies significantly.
  132. compression_params
  133. A ``ZstdCompressionParameters`` instance defining compression settings.
  134. write_checksum
  135. Whether a 4 byte checksum should be written with the compressed data.
  136. Defaults to False. If True, the decompressor can verify that decompressed
  137. data matches the original input data.
  138. write_content_size
  139. Whether the size of the uncompressed data will be written into the
  140. header of compressed data. Defaults to True. The data will only be
  141. written if the compressor knows the size of the input data. This is
  142. often not true for streaming compression.
  143. write_dict_id
  144. Whether to write the dictionary ID into the compressed data.
  145. Defaults to True. The dictionary ID is only written if a dictionary
  146. is being used.
  147. threads
  148. Enables and sets the number of threads to use for multi-threaded compression
  149. operations. Defaults to 0, which means to use single-threaded compression.
  150. Negative values will resolve to the number of logical CPUs in the system.
  151. Read below for more info on multi-threaded compression. This argument only
  152. controls thread count for operations that operate on individual pieces of
  153. data. APIs that spawn multiple threads for working on multiple pieces of
  154. data have their own ``threads`` argument.
  155. ``compression_params`` is mutually exclusive with ``level``, ``write_checksum``,
  156. ``write_content_size``, ``write_dict_id``, and ``threads``.
  157. Unless specified otherwise, assume that no two methods of ``ZstdCompressor``
  158. instances can be called from multiple Python threads simultaneously. In other
  159. words, assume instances are not thread safe unless stated otherwise.
  160. Utility Methods
  161. ^^^^^^^^^^^^^^^
  162. ``frame_progression()`` returns a 3-tuple containing the number of bytes
  163. ingested, consumed, and produced by the current compression operation.
  164. ``memory_size()`` obtains the memory utilization of the underlying zstd
  165. compression context, in bytes.::
  166. cctx = zstd.ZstdCompressor()
  167. memory = cctx.memory_size()
  168. Simple API
  169. ^^^^^^^^^^
  170. ``compress(data)`` compresses and returns data as a one-shot operation.::
  171. cctx = zstd.ZstdCompressor()
  172. compressed = cctx.compress(b'data to compress')
  173. The ``data`` argument can be any object that implements the *buffer protocol*.
  174. Stream Reader API
  175. ^^^^^^^^^^^^^^^^^
  176. ``stream_reader(source)`` can be used to obtain an object conforming to the
  177. ``io.RawIOBase`` interface for reading compressed output as a stream::
  178. with open(path, 'rb') as fh:
  179. cctx = zstd.ZstdCompressor()
  180. reader = cctx.stream_reader(fh)
  181. while True:
  182. chunk = reader.read(16384)
  183. if not chunk:
  184. break
  185. # Do something with compressed chunk.
  186. Instances can also be used as context managers::
  187. with open(path, 'rb') as fh:
  188. with cctx.stream_reader(fh) as reader:
  189. while True:
  190. chunk = reader.read(16384)
  191. if not chunk:
  192. break
  193. # Do something with compressed chunk.
  194. When the context manager exits or ``close()`` is called, the stream is closed,
  195. underlying resources are released, and future operations against the compression
  196. stream will fail.
  197. The ``source`` argument to ``stream_reader()`` can be any object with a
  198. ``read(size)`` method or any object implementing the *buffer protocol*.
  199. ``stream_reader()`` accepts a ``size`` argument specifying how large the input
  200. stream is. This is used to adjust compression parameters so they are
  201. tailored to the source size.::
  202. with open(path, 'rb') as fh:
  203. cctx = zstd.ZstdCompressor()
  204. with cctx.stream_reader(fh, size=os.stat(path).st_size) as reader:
  205. ...
  206. If the ``source`` is a stream, you can specify how large ``read()`` requests
  207. to that stream should be via the ``read_size`` argument. It defaults to
  208. ``zstandard.COMPRESSION_RECOMMENDED_INPUT_SIZE``.::
  209. with open(path, 'rb') as fh:
  210. cctx = zstd.ZstdCompressor()
  211. # Will perform fh.read(8192) when obtaining data to feed into the
  212. # compressor.
  213. with cctx.stream_reader(fh, read_size=8192) as reader:
  214. ...
  215. The stream returned by ``stream_reader()`` is neither writable nor seekable
  216. (even if the underlying source is seekable). ``readline()`` and
  217. ``readlines()`` are not implemented because they don't make sense for
  218. compressed data. ``tell()`` returns the number of compressed bytes
  219. emitted so far.
  220. Streaming Input API
  221. ^^^^^^^^^^^^^^^^^^^
  222. ``stream_writer(fh)`` allows you to *stream* data into a compressor.
  223. Returned instances implement the ``io.RawIOBase`` interface. Only methods
  224. that involve writing will do useful things.
  225. The argument to ``stream_writer()`` must have a ``write(data)`` method. As
  226. compressed data is available, ``write()`` will be called with the compressed
  227. data as its argument. Many common Python types implement ``write()``, including
  228. open file handles and ``io.BytesIO``.
  229. The ``write(data)`` method is used to feed data into the compressor.
  230. The ``flush([flush_mode=FLUSH_BLOCK])`` method can be called to evict whatever
  231. data remains within the compressor's internal state into the output object. This
  232. may result in 0 or more ``write()`` calls to the output object. This method
  233. accepts an optional ``flush_mode`` argument to control the flushing behavior.
  234. Its value can be any of the ``FLUSH_*`` constants.
  235. Both ``write()`` and ``flush()`` return the number of bytes written to the
  236. object's ``write()``. In many cases, small inputs do not accumulate enough
  237. data to cause a write and ``write()`` will return ``0``.
  238. Calling ``close()`` will mark the stream as closed and subsequent I/O
  239. operations will raise ``ValueError`` (per the documented behavior of
  240. ``io.RawIOBase``). ``close()`` will also call ``close()`` on the underlying
  241. stream if such a method exists.
  242. Typically usage is as follows::
  243. cctx = zstd.ZstdCompressor(level=10)
  244. compressor = cctx.stream_writer(fh)
  245. compressor.write(b'chunk 0\n')
  246. compressor.write(b'chunk 1\n')
  247. compressor.flush()
  248. # Receiver will be able to decode ``chunk 0\nchunk 1\n`` at this point.
  249. # Receiver is also expecting more data in the zstd *frame*.
  250. compressor.write(b'chunk 2\n')
  251. compressor.flush(zstd.FLUSH_FRAME)
  252. # Receiver will be able to decode ``chunk 0\nchunk 1\nchunk 2``.
  253. # Receiver is expecting no more data, as the zstd frame is closed.
  254. # Any future calls to ``write()`` at this point will construct a new
  255. # zstd frame.
  256. Instances can be used as context managers. Exiting the context manager is
  257. the equivalent of calling ``close()``, which is equivalent to calling
  258. ``flush(zstd.FLUSH_FRAME)``::
  259. cctx = zstd.ZstdCompressor(level=10)
  260. with cctx.stream_writer(fh) as compressor:
  261. compressor.write(b'chunk 0')
  262. compressor.write(b'chunk 1')
  263. ...
  264. .. important::
  265. If ``flush(FLUSH_FRAME)`` is not called, emitted data doesn't constitute
  266. a full zstd *frame* and consumers of this data may complain about malformed
  267. input. It is recommended to use instances as a context manager to ensure
  268. *frames* are properly finished.
  269. If the size of the data being fed to this streaming compressor is known,
  270. you can declare it before compression begins::
  271. cctx = zstd.ZstdCompressor()
  272. with cctx.stream_writer(fh, size=data_len) as compressor:
  273. compressor.write(chunk0)
  274. compressor.write(chunk1)
  275. ...
  276. Declaring the size of the source data allows compression parameters to
  277. be tuned. And if ``write_content_size`` is used, it also results in the
  278. content size being written into the frame header of the output data.
  279. The size of chunks being ``write()`` to the destination can be specified::
  280. cctx = zstd.ZstdCompressor()
  281. with cctx.stream_writer(fh, write_size=32768) as compressor:
  282. ...
  283. To see how much memory is being used by the streaming compressor::
  284. cctx = zstd.ZstdCompressor()
  285. with cctx.stream_writer(fh) as compressor:
  286. ...
  287. byte_size = compressor.memory_size()
  288. Thte total number of bytes written so far are exposed via ``tell()``::
  289. cctx = zstd.ZstdCompressor()
  290. with cctx.stream_writer(fh) as compressor:
  291. ...
  292. total_written = compressor.tell()
  293. ``stream_writer()`` accepts a ``write_return_read`` boolean argument to control
  294. the return value of ``write()``. When ``False`` (the default), ``write()`` returns
  295. the number of bytes that were ``write()``en to the underlying object. When
  296. ``True``, ``write()`` returns the number of bytes read from the input that
  297. were subsequently written to the compressor. ``True`` is the *proper* behavior
  298. for ``write()`` as specified by the ``io.RawIOBase`` interface and will become
  299. the default value in a future release.
  300. Streaming Output API
  301. ^^^^^^^^^^^^^^^^^^^^
  302. ``read_to_iter(reader)`` provides a mechanism to stream data out of a
  303. compressor as an iterator of data chunks.::
  304. cctx = zstd.ZstdCompressor()
  305. for chunk in cctx.read_to_iter(fh):
  306. # Do something with emitted data.
  307. ``read_to_iter()`` accepts an object that has a ``read(size)`` method or
  308. conforms to the buffer protocol.
  309. Uncompressed data is fetched from the source either by calling ``read(size)``
  310. or by fetching a slice of data from the object directly (in the case where
  311. the buffer protocol is being used). The returned iterator consists of chunks
  312. of compressed data.
  313. If reading from the source via ``read()``, ``read()`` will be called until
  314. it raises or returns an empty bytes (``b''``). It is perfectly valid for
  315. the source to deliver fewer bytes than were what requested by ``read(size)``.
  316. Like ``stream_writer()``, ``read_to_iter()`` also accepts a ``size`` argument
  317. declaring the size of the input stream::
  318. cctx = zstd.ZstdCompressor()
  319. for chunk in cctx.read_to_iter(fh, size=some_int):
  320. pass
  321. You can also control the size that data is ``read()`` from the source and
  322. the ideal size of output chunks::
  323. cctx = zstd.ZstdCompressor()
  324. for chunk in cctx.read_to_iter(fh, read_size=16384, write_size=8192):
  325. pass
  326. Unlike ``stream_writer()``, ``read_to_iter()`` does not give direct control
  327. over the sizes of chunks fed into the compressor. Instead, chunk sizes will
  328. be whatever the object being read from delivers. These will often be of a
  329. uniform size.
  330. Stream Copying API
  331. ^^^^^^^^^^^^^^^^^^
  332. ``copy_stream(ifh, ofh)`` can be used to copy data between 2 streams while
  333. compressing it.::
  334. cctx = zstd.ZstdCompressor()
  335. cctx.copy_stream(ifh, ofh)
  336. For example, say you wish to compress a file::
  337. cctx = zstd.ZstdCompressor()
  338. with open(input_path, 'rb') as ifh, open(output_path, 'wb') as ofh:
  339. cctx.copy_stream(ifh, ofh)
  340. It is also possible to declare the size of the source stream::
  341. cctx = zstd.ZstdCompressor()
  342. cctx.copy_stream(ifh, ofh, size=len_of_input)
  343. You can also specify how large the chunks that are ``read()`` and ``write()``
  344. from and to the streams::
  345. cctx = zstd.ZstdCompressor()
  346. cctx.copy_stream(ifh, ofh, read_size=32768, write_size=16384)
  347. The stream copier returns a 2-tuple of bytes read and written::
  348. cctx = zstd.ZstdCompressor()
  349. read_count, write_count = cctx.copy_stream(ifh, ofh)
  350. Compressor API
  351. ^^^^^^^^^^^^^^
  352. ``compressobj()`` returns an object that exposes ``compress(data)`` and
  353. ``flush()`` methods. Each returns compressed data or an empty bytes.
  354. The purpose of ``compressobj()`` is to provide an API-compatible interface
  355. with ``zlib.compressobj``, ``bz2.BZ2Compressor``, etc. This allows callers to
  356. swap in different compressor objects while using the same API.
  357. ``flush()`` accepts an optional argument indicating how to end the stream.
  358. ``zstd.COMPRESSOBJ_FLUSH_FINISH`` (the default) ends the compression stream.
  359. Once this type of flush is performed, ``compress()`` and ``flush()`` can
  360. no longer be called. This type of flush **must** be called to end the
  361. compression context. If not called, returned data may be incomplete.
  362. A ``zstd.COMPRESSOBJ_FLUSH_BLOCK`` argument to ``flush()`` will flush a
  363. zstd block. Flushes of this type can be performed multiple times. The next
  364. call to ``compress()`` will begin a new zstd block.
  365. Here is how this API should be used::
  366. cctx = zstd.ZstdCompressor()
  367. cobj = cctx.compressobj()
  368. data = cobj.compress(b'raw input 0')
  369. data = cobj.compress(b'raw input 1')
  370. data = cobj.flush()
  371. Or to flush blocks::
  372. cctx.zstd.ZstdCompressor()
  373. cobj = cctx.compressobj()
  374. data = cobj.compress(b'chunk in first block')
  375. data = cobj.flush(zstd.COMPRESSOBJ_FLUSH_BLOCK)
  376. data = cobj.compress(b'chunk in second block')
  377. data = cobj.flush()
  378. For best performance results, keep input chunks under 256KB. This avoids
  379. extra allocations for a large output object.
  380. It is possible to declare the input size of the data that will be fed into
  381. the compressor::
  382. cctx = zstd.ZstdCompressor()
  383. cobj = cctx.compressobj(size=6)
  384. data = cobj.compress(b'foobar')
  385. data = cobj.flush()
  386. Chunker API
  387. ^^^^^^^^^^^
  388. ``chunker(size=None, chunk_size=COMPRESSION_RECOMMENDED_OUTPUT_SIZE)`` returns
  389. an object that can be used to iteratively feed chunks of data into a compressor
  390. and produce output chunks of a uniform size.
  391. The object returned by ``chunker()`` exposes the following methods:
  392. ``compress(data)``
  393. Feeds new input data into the compressor.
  394. ``flush()``
  395. Flushes all data currently in the compressor.
  396. ``finish()``
  397. Signals the end of input data. No new data can be compressed after this
  398. method is called.
  399. ``compress()``, ``flush()``, and ``finish()`` all return an iterator of
  400. ``bytes`` instances holding compressed data. The iterator may be empty. Callers
  401. MUST iterate through all elements of the returned iterator before performing
  402. another operation on the object.
  403. All chunks emitted by ``compress()`` will have a length of ``chunk_size``.
  404. ``flush()`` and ``finish()`` may return a final chunk smaller than
  405. ``chunk_size``.
  406. Here is how the API should be used::
  407. cctx = zstd.ZstdCompressor()
  408. chunker = cctx.chunker(chunk_size=32768)
  409. with open(path, 'rb') as fh:
  410. while True:
  411. in_chunk = fh.read(32768)
  412. if not in_chunk:
  413. break
  414. for out_chunk in chunker.compress(in_chunk):
  415. # Do something with output chunk of size 32768.
  416. for out_chunk in chunker.finish():
  417. # Do something with output chunks that finalize the zstd frame.
  418. The ``chunker()`` API is often a better alternative to ``compressobj()``.
  419. ``compressobj()`` will emit output data as it is available. This results in a
  420. *stream* of output chunks of varying sizes. The consistency of the output chunk
  421. size with ``chunker()`` is more appropriate for many usages, such as sending
  422. compressed data to a socket.
  423. ``compressobj()`` may also perform extra memory reallocations in order to
  424. dynamically adjust the sizes of the output chunks. Since ``chunker()`` output
  425. chunks are all the same size (except for flushed or final chunks), there is
  426. less memory allocation overhead.
  427. Batch Compression API
  428. ^^^^^^^^^^^^^^^^^^^^^
  429. (Experimental. Not yet supported in CFFI bindings.)
  430. ``multi_compress_to_buffer(data, [threads=0])`` performs compression of multiple
  431. inputs as a single operation.
  432. Data to be compressed can be passed as a ``BufferWithSegmentsCollection``, a
  433. ``BufferWithSegments``, or a list containing byte like objects. Each element of
  434. the container will be compressed individually using the configured parameters
  435. on the ``ZstdCompressor`` instance.
  436. The ``threads`` argument controls how many threads to use for compression. The
  437. default is ``0`` which means to use a single thread. Negative values use the
  438. number of logical CPUs in the machine.
  439. The function returns a ``BufferWithSegmentsCollection``. This type represents
  440. N discrete memory allocations, eaching holding 1 or more compressed frames.
  441. Output data is written to shared memory buffers. This means that unlike
  442. regular Python objects, a reference to *any* object within the collection
  443. keeps the shared buffer and therefore memory backing it alive. This can have
  444. undesirable effects on process memory usage.
  445. The API and behavior of this function is experimental and will likely change.
  446. Known deficiencies include:
  447. * If asked to use multiple threads, it will always spawn that many threads,
  448. even if the input is too small to use them. It should automatically lower
  449. the thread count when the extra threads would just add overhead.
  450. * The buffer allocation strategy is fixed. There is room to make it dynamic,
  451. perhaps even to allow one output buffer per input, facilitating a variation
  452. of the API to return a list without the adverse effects of shared memory
  453. buffers.
  454. ZstdDecompressor
  455. ----------------
  456. The ``ZstdDecompressor`` class provides an interface for performing
  457. decompression. It is effectively a wrapper around the ``ZSTD_DCtx`` type from
  458. the C API.
  459. Each instance is associated with parameters that control decompression. These
  460. come from the following named arguments (all optional):
  461. dict_data
  462. Compression dictionary to use.
  463. max_window_size
  464. Sets an uppet limit on the window size for decompression operations in
  465. kibibytes. This setting can be used to prevent large memory allocations
  466. for inputs using large compression windows.
  467. format
  468. Set the format of data for the decoder. By default, this is
  469. ``zstd.FORMAT_ZSTD1``. It can be set to ``zstd.FORMAT_ZSTD1_MAGICLESS`` to
  470. allow decoding frames without the 4 byte magic header. Not all decompression
  471. APIs support this mode.
  472. The interface of this class is very similar to ``ZstdCompressor`` (by design).
  473. Unless specified otherwise, assume that no two methods of ``ZstdDecompressor``
  474. instances can be called from multiple Python threads simultaneously. In other
  475. words, assume instances are not thread safe unless stated otherwise.
  476. Utility Methods
  477. ^^^^^^^^^^^^^^^
  478. ``memory_size()`` obtains the size of the underlying zstd decompression context,
  479. in bytes.::
  480. dctx = zstd.ZstdDecompressor()
  481. size = dctx.memory_size()
  482. Simple API
  483. ^^^^^^^^^^
  484. ``decompress(data)`` can be used to decompress an entire compressed zstd
  485. frame in a single operation.::
  486. dctx = zstd.ZstdDecompressor()
  487. decompressed = dctx.decompress(data)
  488. By default, ``decompress(data)`` will only work on data written with the content
  489. size encoded in its header (this is the default behavior of
  490. ``ZstdCompressor().compress()`` but may not be true for streaming compression). If
  491. compressed data without an embedded content size is seen, ``zstd.ZstdError`` will
  492. be raised.
  493. If the compressed data doesn't have its content size embedded within it,
  494. decompression can be attempted by specifying the ``max_output_size``
  495. argument.::
  496. dctx = zstd.ZstdDecompressor()
  497. uncompressed = dctx.decompress(data, max_output_size=1048576)
  498. Ideally, ``max_output_size`` will be identical to the decompressed output
  499. size.
  500. If ``max_output_size`` is too small to hold the decompressed data,
  501. ``zstd.ZstdError`` will be raised.
  502. If ``max_output_size`` is larger than the decompressed data, the allocated
  503. output buffer will be resized to only use the space required.
  504. Please note that an allocation of the requested ``max_output_size`` will be
  505. performed every time the method is called. Setting to a very large value could
  506. result in a lot of work for the memory allocator and may result in
  507. ``MemoryError`` being raised if the allocation fails.
  508. .. important::
  509. If the exact size of decompressed data is unknown (not passed in explicitly
  510. and not stored in the zstandard frame), for performance reasons it is
  511. encouraged to use a streaming API.
  512. Stream Reader API
  513. ^^^^^^^^^^^^^^^^^
  514. ``stream_reader(source)`` can be used to obtain an object conforming to the
  515. ``io.RawIOBase`` interface for reading decompressed output as a stream::
  516. with open(path, 'rb') as fh:
  517. dctx = zstd.ZstdDecompressor()
  518. reader = dctx.stream_reader(fh)
  519. while True:
  520. chunk = reader.read(16384)
  521. if not chunk:
  522. break
  523. # Do something with decompressed chunk.
  524. The stream can also be used as a context manager::
  525. with open(path, 'rb') as fh:
  526. dctx = zstd.ZstdDecompressor()
  527. with dctx.stream_reader(fh) as reader:
  528. ...
  529. When used as a context manager, the stream is closed and the underlying
  530. resources are released when the context manager exits. Future operations against
  531. the stream will fail.
  532. The ``source`` argument to ``stream_reader()`` can be any object with a
  533. ``read(size)`` method or any object implementing the *buffer protocol*.
  534. If the ``source`` is a stream, you can specify how large ``read()`` requests
  535. to that stream should be via the ``read_size`` argument. It defaults to
  536. ``zstandard.DECOMPRESSION_RECOMMENDED_INPUT_SIZE``.::
  537. with open(path, 'rb') as fh:
  538. dctx = zstd.ZstdDecompressor()
  539. # Will perform fh.read(8192) when obtaining data for the decompressor.
  540. with dctx.stream_reader(fh, read_size=8192) as reader:
  541. ...
  542. The stream returned by ``stream_reader()`` is not writable.
  543. The stream returned by ``stream_reader()`` is *partially* seekable.
  544. Absolute and relative positions (``SEEK_SET`` and ``SEEK_CUR``) forward
  545. of the current position are allowed. Offsets behind the current read
  546. position and offsets relative to the end of stream are not allowed and
  547. will raise ``ValueError`` if attempted.
  548. ``tell()`` returns the number of decompressed bytes read so far.
  549. Not all I/O methods are implemented. Notably missing is support for
  550. ``readline()``, ``readlines()``, and linewise iteration support. This is
  551. because streams operate on binary data - not text data. If you want to
  552. convert decompressed output to text, you can chain an ``io.TextIOWrapper``
  553. to the stream::
  554. with open(path, 'rb') as fh:
  555. dctx = zstd.ZstdDecompressor()
  556. stream_reader = dctx.stream_reader(fh)
  557. text_stream = io.TextIOWrapper(stream_reader, encoding='utf-8')
  558. for line in text_stream:
  559. ...
  560. The ``read_across_frames`` argument to ``stream_reader()`` controls the
  561. behavior of read operations when the end of a zstd *frame* is encountered.
  562. When ``False`` (the default), a read will complete when the end of a
  563. zstd *frame* is encountered. When ``True``, a read can potentially
  564. return data spanning multiple zstd *frames*.
  565. Streaming Input API
  566. ^^^^^^^^^^^^^^^^^^^
  567. ``stream_writer(fh)`` allows you to *stream* data into a decompressor.
  568. Returned instances implement the ``io.RawIOBase`` interface. Only methods
  569. that involve writing will do useful things.
  570. The argument to ``stream_writer()`` is typically an object that also implements
  571. ``io.RawIOBase``. But any object with a ``write(data)`` method will work. Many
  572. common Python types conform to this interface, including open file handles
  573. and ``io.BytesIO``.
  574. Behavior is similar to ``ZstdCompressor.stream_writer()``: compressed data
  575. is sent to the decompressor by calling ``write(data)`` and decompressed
  576. output is written to the underlying stream by calling its ``write(data)``
  577. method.::
  578. dctx = zstd.ZstdDecompressor()
  579. decompressor = dctx.stream_writer(fh)
  580. decompressor.write(compressed_data)
  581. ...
  582. Calls to ``write()`` will return the number of bytes written to the output
  583. object. Not all inputs will result in bytes being written, so return values
  584. of ``0`` are possible.
  585. Like the ``stream_writer()`` compressor, instances can be used as context
  586. managers. However, context managers add no extra special behavior and offer
  587. little to no benefit to being used.
  588. Calling ``close()`` will mark the stream as closed and subsequent I/O operations
  589. will raise ``ValueError`` (per the documented behavior of ``io.RawIOBase``).
  590. ``close()`` will also call ``close()`` on the underlying stream if such a
  591. method exists.
  592. The size of chunks being ``write()`` to the destination can be specified::
  593. dctx = zstd.ZstdDecompressor()
  594. with dctx.stream_writer(fh, write_size=16384) as decompressor:
  595. pass
  596. You can see how much memory is being used by the decompressor::
  597. dctx = zstd.ZstdDecompressor()
  598. with dctx.stream_writer(fh) as decompressor:
  599. byte_size = decompressor.memory_size()
  600. ``stream_writer()`` accepts a ``write_return_read`` boolean argument to control
  601. the return value of ``write()``. When ``False`` (the default)``, ``write()``
  602. returns the number of bytes that were ``write()``en to the underlying stream.
  603. When ``True``, ``write()`` returns the number of bytes read from the input.
  604. ``True`` is the *proper* behavior for ``write()`` as specified by the
  605. ``io.RawIOBase`` interface and will become the default in a future release.
  606. Streaming Output API
  607. ^^^^^^^^^^^^^^^^^^^^
  608. ``read_to_iter(fh)`` provides a mechanism to stream decompressed data out of a
  609. compressed source as an iterator of data chunks.::
  610. dctx = zstd.ZstdDecompressor()
  611. for chunk in dctx.read_to_iter(fh):
  612. # Do something with original data.
  613. ``read_to_iter()`` accepts an object with a ``read(size)`` method that will
  614. return compressed bytes or an object conforming to the buffer protocol that
  615. can expose its data as a contiguous range of bytes.
  616. ``read_to_iter()`` returns an iterator whose elements are chunks of the
  617. decompressed data.
  618. The size of requested ``read()`` from the source can be specified::
  619. dctx = zstd.ZstdDecompressor()
  620. for chunk in dctx.read_to_iter(fh, read_size=16384):
  621. pass
  622. It is also possible to skip leading bytes in the input data::
  623. dctx = zstd.ZstdDecompressor()
  624. for chunk in dctx.read_to_iter(fh, skip_bytes=1):
  625. pass
  626. .. tip::
  627. Skipping leading bytes is useful if the source data contains extra
  628. *header* data. Traditionally, you would need to create a slice or
  629. ``memoryview`` of the data you want to decompress. This would create
  630. overhead. It is more efficient to pass the offset into this API.
  631. Similarly to ``ZstdCompressor.read_to_iter()``, the consumer of the iterator
  632. controls when data is decompressed. If the iterator isn't consumed,
  633. decompression is put on hold.
  634. When ``read_to_iter()`` is passed an object conforming to the buffer protocol,
  635. the behavior may seem similar to what occurs when the simple decompression
  636. API is used. However, this API works when the decompressed size is unknown.
  637. Furthermore, if feeding large inputs, the decompressor will work in chunks
  638. instead of performing a single operation.
  639. Stream Copying API
  640. ^^^^^^^^^^^^^^^^^^
  641. ``copy_stream(ifh, ofh)`` can be used to copy data across 2 streams while
  642. performing decompression.::
  643. dctx = zstd.ZstdDecompressor()
  644. dctx.copy_stream(ifh, ofh)
  645. e.g. to decompress a file to another file::
  646. dctx = zstd.ZstdDecompressor()
  647. with open(input_path, 'rb') as ifh, open(output_path, 'wb') as ofh:
  648. dctx.copy_stream(ifh, ofh)
  649. The size of chunks being ``read()`` and ``write()`` from and to the streams
  650. can be specified::
  651. dctx = zstd.ZstdDecompressor()
  652. dctx.copy_stream(ifh, ofh, read_size=8192, write_size=16384)
  653. Decompressor API
  654. ^^^^^^^^^^^^^^^^
  655. ``decompressobj()`` returns an object that exposes a ``decompress(data)``
  656. method. Compressed data chunks are fed into ``decompress(data)`` and
  657. uncompressed output (or an empty bytes) is returned. Output from subsequent
  658. calls needs to be concatenated to reassemble the full decompressed byte
  659. sequence.
  660. The purpose of ``decompressobj()`` is to provide an API-compatible interface
  661. with ``zlib.decompressobj`` and ``bz2.BZ2Decompressor``. This allows callers
  662. to swap in different decompressor objects while using the same API.
  663. Each object is single use: once an input frame is decoded, ``decompress()``
  664. can no longer be called.
  665. Here is how this API should be used::
  666. dctx = zstd.ZstdDecompressor()
  667. dobj = dctx.decompressobj()
  668. data = dobj.decompress(compressed_chunk_0)
  669. data = dobj.decompress(compressed_chunk_1)
  670. By default, calls to ``decompress()`` write output data in chunks of size
  671. ``DECOMPRESSION_RECOMMENDED_OUTPUT_SIZE``. These chunks are concatenated
  672. before being returned to the caller. It is possible to define the size of
  673. these temporary chunks by passing ``write_size`` to ``decompressobj()``::
  674. dctx = zstd.ZstdDecompressor()
  675. dobj = dctx.decompressobj(write_size=1048576)
  676. .. note::
  677. Because calls to ``decompress()`` may need to perform multiple
  678. memory (re)allocations, this streaming decompression API isn't as
  679. efficient as other APIs.
  680. For compatibility with the standard library APIs, instances expose a
  681. ``flush([length=None])`` method. This method no-ops and has no meaningful
  682. side-effects, making it safe to call any time.
  683. Batch Decompression API
  684. ^^^^^^^^^^^^^^^^^^^^^^^
  685. (Experimental. Not yet supported in CFFI bindings.)
  686. ``multi_decompress_to_buffer()`` performs decompression of multiple
  687. frames as a single operation and returns a ``BufferWithSegmentsCollection``
  688. containing decompressed data for all inputs.
  689. Compressed frames can be passed to the function as a ``BufferWithSegments``,
  690. a ``BufferWithSegmentsCollection``, or as a list containing objects that
  691. conform to the buffer protocol. For best performance, pass a
  692. ``BufferWithSegmentsCollection`` or a ``BufferWithSegments``, as
  693. minimal input validation will be done for that type. If calling from
  694. Python (as opposed to C), constructing one of these instances may add
  695. overhead cancelling out the performance overhead of validation for list
  696. inputs.::
  697. dctx = zstd.ZstdDecompressor()
  698. results = dctx.multi_decompress_to_buffer([b'...', b'...'])
  699. The decompressed size of each frame MUST be discoverable. It can either be
  700. embedded within the zstd frame (``write_content_size=True`` argument to
  701. ``ZstdCompressor``) or passed in via the ``decompressed_sizes`` argument.
  702. The ``decompressed_sizes`` argument is an object conforming to the buffer
  703. protocol which holds an array of 64-bit unsigned integers in the machine's
  704. native format defining the decompressed sizes of each frame. If this argument
  705. is passed, it avoids having to scan each frame for its decompressed size.
  706. This frame scanning can add noticeable overhead in some scenarios.::
  707. frames = [...]
  708. sizes = struct.pack('=QQQQ', len0, len1, len2, len3)
  709. dctx = zstd.ZstdDecompressor()
  710. results = dctx.multi_decompress_to_buffer(frames, decompressed_sizes=sizes)
  711. The ``threads`` argument controls the number of threads to use to perform
  712. decompression operations. The default (``0``) or the value ``1`` means to
  713. use a single thread. Negative values use the number of logical CPUs in the
  714. machine.
  715. .. note::
  716. It is possible to pass a ``mmap.mmap()`` instance into this function by
  717. wrapping it with a ``BufferWithSegments`` instance (which will define the
  718. offsets of frames within the memory mapped region).
  719. This function is logically equivalent to performing ``dctx.decompress()``
  720. on each input frame and returning the result.
  721. This function exists to perform decompression on multiple frames as fast
  722. as possible by having as little overhead as possible. Since decompression is
  723. performed as a single operation and since the decompressed output is stored in
  724. a single buffer, extra memory allocations, Python objects, and Python function
  725. calls are avoided. This is ideal for scenarios where callers know up front that
  726. they need to access data for multiple frames, such as when *delta chains* are
  727. being used.
  728. Currently, the implementation always spawns multiple threads when requested,
  729. even if the amount of work to do is small. In the future, it will be smarter
  730. about avoiding threads and their associated overhead when the amount of
  731. work to do is small.
  732. Prefix Dictionary Chain Decompression
  733. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  734. ``decompress_content_dict_chain(frames)`` performs decompression of a list of
  735. zstd frames produced using chained *prefix* dictionary compression. Such
  736. a list of frames is produced by compressing discrete inputs where each
  737. non-initial input is compressed with a *prefix* dictionary consisting of the
  738. content of the previous input.
  739. For example, say you have the following inputs::
  740. inputs = [b'input 1', b'input 2', b'input 3']
  741. The zstd frame chain consists of:
  742. 1. ``b'input 1'`` compressed in standalone/discrete mode
  743. 2. ``b'input 2'`` compressed using ``b'input 1'`` as a *prefix* dictionary
  744. 3. ``b'input 3'`` compressed using ``b'input 2'`` as a *prefix* dictionary
  745. Each zstd frame **must** have the content size written.
  746. The following Python code can be used to produce a *prefix dictionary chain*::
  747. def make_chain(inputs):
  748. frames = []
  749. # First frame is compressed in standalone/discrete mode.
  750. zctx = zstd.ZstdCompressor()
  751. frames.append(zctx.compress(inputs[0]))
  752. # Subsequent frames use the previous fulltext as a prefix dictionary
  753. for i, raw in enumerate(inputs[1:]):
  754. dict_data = zstd.ZstdCompressionDict(
  755. inputs[i], dict_type=zstd.DICT_TYPE_RAWCONTENT)
  756. zctx = zstd.ZstdCompressor(dict_data=dict_data)
  757. frames.append(zctx.compress(raw))
  758. return frames
  759. ``decompress_content_dict_chain()`` returns the uncompressed data of the last
  760. element in the input chain.
  761. .. note::
  762. It is possible to implement *prefix dictionary chain* decompression
  763. on top of other APIs. However, this function will likely be faster -
  764. especially for long input chains - as it avoids the overhead of instantiating
  765. and passing around intermediate objects between C and Python.
  766. Multi-Threaded Compression
  767. --------------------------
  768. ``ZstdCompressor`` accepts a ``threads`` argument that controls the number
  769. of threads to use for compression. The way this works is that input is split
  770. into segments and each segment is fed into a worker pool for compression. Once
  771. a segment is compressed, it is flushed/appended to the output.
  772. .. note::
  773. These threads are created at the C layer and are not Python threads. So they
  774. work outside the GIL. It is therefore possible to CPU saturate multiple cores
  775. from Python.
  776. The segment size for multi-threaded compression is chosen from the window size
  777. of the compressor. This is derived from the ``window_log`` attribute of a
  778. ``ZstdCompressionParameters`` instance. By default, segment sizes are in the 1+MB
  779. range.
  780. If multi-threaded compression is requested and the input is smaller than the
  781. configured segment size, only a single compression thread will be used. If the
  782. input is smaller than the segment size multiplied by the thread pool size or
  783. if data cannot be delivered to the compressor fast enough, not all requested
  784. compressor threads may be active simultaneously.
  785. Compared to non-multi-threaded compression, multi-threaded compression has
  786. higher per-operation overhead. This includes extra memory operations,
  787. thread creation, lock acquisition, etc.
  788. Due to the nature of multi-threaded compression using *N* compression
  789. *states*, the output from multi-threaded compression will likely be larger
  790. than non-multi-threaded compression. The difference is usually small. But
  791. there is a CPU/wall time versus size trade off that may warrant investigation.
  792. Output from multi-threaded compression does not require any special handling
  793. on the decompression side. To the decompressor, data generated with single
  794. threaded compressor looks the same as data generated by a multi-threaded
  795. compressor and does not require any special handling or additional resource
  796. requirements.
  797. Dictionary Creation and Management
  798. ----------------------------------
  799. Compression dictionaries are represented with the ``ZstdCompressionDict`` type.
  800. Instances can be constructed from bytes::
  801. dict_data = zstd.ZstdCompressionDict(data)
  802. It is possible to construct a dictionary from *any* data. If the data doesn't
  803. begin with a magic header, it will be treated as a *prefix* dictionary.
  804. *Prefix* dictionaries allow compression operations to reference raw data
  805. within the dictionary.
  806. It is possible to force the use of *prefix* dictionaries or to require a
  807. dictionary header:
  808. dict_data = zstd.ZstdCompressionDict(data,
  809. dict_type=zstd.DICT_TYPE_RAWCONTENT)
  810. dict_data = zstd.ZstdCompressionDict(data,
  811. dict_type=zstd.DICT_TYPE_FULLDICT)
  812. You can see how many bytes are in the dictionary by calling ``len()``::
  813. dict_data = zstd.train_dictionary(size, samples)
  814. dict_size = len(dict_data) # will not be larger than ``size``
  815. Once you have a dictionary, you can pass it to the objects performing
  816. compression and decompression::
  817. dict_data = zstd.train_dictionary(131072, samples)
  818. cctx = zstd.ZstdCompressor(dict_data=dict_data)
  819. for source_data in input_data:
  820. compressed = cctx.compress(source_data)
  821. # Do something with compressed data.
  822. dctx = zstd.ZstdDecompressor(dict_data=dict_data)
  823. for compressed_data in input_data:
  824. buffer = io.BytesIO()
  825. with dctx.stream_writer(buffer) as decompressor:
  826. decompressor.write(compressed_data)
  827. # Do something with raw data in ``buffer``.
  828. Dictionaries have unique integer IDs. You can retrieve this ID via::
  829. dict_id = zstd.dictionary_id(dict_data)
  830. You can obtain the raw data in the dict (useful for persisting and constructing
  831. a ``ZstdCompressionDict`` later) via ``as_bytes()``::
  832. dict_data = zstd.train_dictionary(size, samples)
  833. raw_data = dict_data.as_bytes()
  834. By default, when a ``ZstdCompressionDict`` is *attached* to a
  835. ``ZstdCompressor``, each ``ZstdCompressor`` performs work to prepare the
  836. dictionary for use. This is fine if only 1 compression operation is being
  837. performed or if the ``ZstdCompressor`` is being reused for multiple operations.
  838. But if multiple ``ZstdCompressor`` instances are being used with the dictionary,
  839. this can add overhead.
  840. It is possible to *precompute* the dictionary so it can readily be consumed
  841. by multiple ``ZstdCompressor`` instances::
  842. d = zstd.ZstdCompressionDict(data)
  843. # Precompute for compression level 3.
  844. d.precompute_compress(level=3)
  845. # Precompute with specific compression parameters.
  846. params = zstd.ZstdCompressionParameters(...)
  847. d.precompute_compress(compression_params=params)
  848. .. note::
  849. When a dictionary is precomputed, the compression parameters used to
  850. precompute the dictionary overwrite some of the compression parameters
  851. specified to ``ZstdCompressor.__init__``.
  852. Training Dictionaries
  853. ^^^^^^^^^^^^^^^^^^^^^
  854. Unless using *prefix* dictionaries, dictionary data is produced by *training*
  855. on existing data::
  856. dict_data = zstd.train_dictionary(size, samples)
  857. This takes a target dictionary size and list of bytes instances and creates and
  858. returns a ``ZstdCompressionDict``.
  859. The dictionary training mechanism is known as *cover*. More details about it are
  860. available in the paper *Effective Construction of Relative Lempel-Ziv
  861. Dictionaries* (authors: Liao, Petri, Moffat, Wirth).
  862. The cover algorithm takes parameters ``k` and ``d``. These are the
  863. *segment size* and *dmer size*, respectively. The returned dictionary
  864. instance created by this function has ``k`` and ``d`` attributes
  865. containing the values for these parameters. If a ``ZstdCompressionDict``
  866. is constructed from raw bytes data (a content-only dictionary), the
  867. ``k`` and ``d`` attributes will be ``0``.
  868. The segment and dmer size parameters to the cover algorithm can either be
  869. specified manually or ``train_dictionary()`` can try multiple values
  870. and pick the best one, where *best* means the smallest compressed data size.
  871. This later mode is called *optimization* mode.
  872. If none of ``k``, ``d``, ``steps``, ``threads``, ``level``, ``notifications``,
  873. or ``dict_id`` (basically anything from the underlying ``ZDICT_cover_params_t``
  874. struct) are defined, *optimization* mode is used with default parameter
  875. values.
  876. If ``steps`` or ``threads`` are defined, then *optimization* mode is engaged
  877. with explicit control over those parameters. Specifying ``threads=0`` or
  878. ``threads=1`` can be used to engage *optimization* mode if other parameters
  879. are not defined.
  880. Otherwise, non-*optimization* mode is used with the parameters specified.
  881. This function takes the following arguments:
  882. dict_size
  883. Target size in bytes of the dictionary to generate.
  884. samples
  885. A list of bytes holding samples the dictionary will be trained from.
  886. k
  887. Parameter to cover algorithm defining the segment size. A reasonable range
  888. is [16, 2048+].
  889. d
  890. Parameter to cover algorithm defining the dmer size. A reasonable range is
  891. [6, 16]. ``d`` must be less than or equal to ``k``.
  892. dict_id
  893. Integer dictionary ID for the produced dictionary. Default is 0, which uses
  894. a random value.
  895. steps
  896. Number of steps through ``k`` values to perform when trying parameter
  897. variations.
  898. threads
  899. Number of threads to use when trying parameter variations. Default is 0,
  900. which means to use a single thread. A negative value can be specified to
  901. use as many threads as there are detected logical CPUs.
  902. level
  903. Integer target compression level when trying parameter variations.
  904. notifications
  905. Controls writing of informational messages to ``stderr``. ``0`` (the
  906. default) means to write nothing. ``1`` writes errors. ``2`` writes
  907. progression info. ``3`` writes more details. And ``4`` writes all info.
  908. Explicit Compression Parameters
  909. -------------------------------
  910. Zstandard offers a high-level *compression level* that maps to lower-level
  911. compression parameters. For many consumers, this numeric level is the only
  912. compression setting you'll need to touch.
  913. But for advanced use cases, it might be desirable to tweak these lower-level
  914. settings.
  915. The ``ZstdCompressionParameters`` type represents these low-level compression
  916. settings.
  917. Instances of this type can be constructed from a myriad of keyword arguments
  918. (defined below) for complete low-level control over each adjustable
  919. compression setting.
  920. From a higher level, one can construct a ``ZstdCompressionParameters`` instance
  921. given a desired compression level and target input and dictionary size
  922. using ``ZstdCompressionParameters.from_level()``. e.g.::
  923. # Derive compression settings for compression level 7.
  924. params = zstd.ZstdCompressionParameters.from_level(7)
  925. # With an input size of 1MB
  926. params = zstd.ZstdCompressionParameters.from_level(7, source_size=1048576)
  927. Using ``from_level()``, it is also possible to override individual compression
  928. parameters or to define additional settings that aren't automatically derived.
  929. e.g.::
  930. params = zstd.ZstdCompressionParameters.from_level(4, window_log=10)
  931. params = zstd.ZstdCompressionParameters.from_level(5, threads=4)
  932. Or you can define low-level compression settings directly::
  933. params = zstd.ZstdCompressionParameters(window_log=12, enable_ldm=True)
  934. Once a ``ZstdCompressionParameters`` instance is obtained, it can be used to
  935. configure a compressor::
  936. cctx = zstd.ZstdCompressor(compression_params=params)
  937. The named arguments and attributes of ``ZstdCompressionParameters`` are as
  938. follows:
  939. * format
  940. * compression_level
  941. * window_log
  942. * hash_log
  943. * chain_log
  944. * search_log
  945. * min_match
  946. * target_length
  947. * strategy
  948. * compression_strategy (deprecated: same as ``strategy``)
  949. * write_content_size
  950. * write_checksum
  951. * write_dict_id
  952. * job_size
  953. * overlap_log
  954. * overlap_size_log (deprecated: same as ``overlap_log``)
  955. * force_max_window
  956. * enable_ldm
  957. * ldm_hash_log
  958. * ldm_min_match
  959. * ldm_bucket_size_log
  960. * ldm_hash_rate_log
  961. * ldm_hash_every_log (deprecated: same as ``ldm_hash_rate_log``)
  962. * threads
  963. Some of these are very low-level settings. It may help to consult the official
  964. zstandard documentation for their behavior. Look for the ``ZSTD_p_*`` constants
  965. in ``zstd.h`` (https://github.com/facebook/zstd/blob/dev/lib/zstd.h).
  966. Frame Inspection
  967. ----------------
  968. Data emitted from zstd compression is encapsulated in a *frame*. This frame
  969. begins with a 4 byte *magic number* header followed by 2 to 14 bytes describing
  970. the frame in more detail. For more info, see
  971. https://github.com/facebook/zstd/blob/master/doc/zstd_compression_format.md.
  972. ``zstd.get_frame_parameters(data)`` parses a zstd *frame* header from a bytes
  973. instance and return a ``FrameParameters`` object describing the frame.
  974. Depending on which fields are present in the frame and their values, the
  975. length of the frame parameters varies. If insufficient bytes are passed
  976. in to fully parse the frame parameters, ``ZstdError`` is raised. To ensure
  977. frame parameters can be parsed, pass in at least 18 bytes.
  978. ``FrameParameters`` instances have the following attributes:
  979. content_size
  980. Integer size of original, uncompressed content. This will be ``0`` if the
  981. original content size isn't written to the frame (controlled with the
  982. ``write_content_size`` argument to ``ZstdCompressor``) or if the input
  983. content size was ``0``.
  984. window_size
  985. Integer size of maximum back-reference distance in compressed data.
  986. dict_id
  987. Integer of dictionary ID used for compression. ``0`` if no dictionary
  988. ID was used or if the dictionary ID was ``0``.
  989. has_checksum
  990. Bool indicating whether a 4 byte content checksum is stored at the end
  991. of the frame.
  992. ``zstd.frame_header_size(data)`` returns the size of the zstandard frame
  993. header.
  994. ``zstd.frame_content_size(data)`` returns the content size as parsed from
  995. the frame header. ``-1`` means the content size is unknown. ``0`` means
  996. an empty frame. The content size is usually correct. However, it may not
  997. be accurate.
  998. Misc Functionality
  999. ------------------
  1000. estimate_decompression_context_size()
  1001. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  1002. Estimate the memory size requirements for a decompressor instance.
  1003. Constants
  1004. ---------
  1005. The following module constants/attributes are exposed:
  1006. ZSTD_VERSION
  1007. This module attribute exposes a 3-tuple of the Zstandard version. e.g.
  1008. ``(1, 0, 0)``
  1009. MAX_COMPRESSION_LEVEL
  1010. Integer max compression level accepted by compression functions
  1011. COMPRESSION_RECOMMENDED_INPUT_SIZE
  1012. Recommended chunk size to feed to compressor functions
  1013. COMPRESSION_RECOMMENDED_OUTPUT_SIZE
  1014. Recommended chunk size for compression output
  1015. DECOMPRESSION_RECOMMENDED_INPUT_SIZE
  1016. Recommended chunk size to feed into decompresor functions
  1017. DECOMPRESSION_RECOMMENDED_OUTPUT_SIZE
  1018. Recommended chunk size for decompression output
  1019. FRAME_HEADER
  1020. bytes containing header of the Zstandard frame
  1021. MAGIC_NUMBER
  1022. Frame header as an integer
  1023. FLUSH_BLOCK
  1024. Flushing behavior that denotes to flush a zstd block. A decompressor will
  1025. be able to decode all data fed into the compressor so far.
  1026. FLUSH_FRAME
  1027. Flushing behavior that denotes to end a zstd frame. Any new data fed
  1028. to the compressor will start a new frame.
  1029. CONTENTSIZE_UNKNOWN
  1030. Value for content size when the content size is unknown.
  1031. CONTENTSIZE_ERROR
  1032. Value for content size when content size couldn't be determined.
  1033. WINDOWLOG_MIN
  1034. Minimum value for compression parameter
  1035. WINDOWLOG_MAX
  1036. Maximum value for compression parameter
  1037. CHAINLOG_MIN
  1038. Minimum value for compression parameter
  1039. CHAINLOG_MAX
  1040. Maximum value for compression parameter
  1041. HASHLOG_MIN
  1042. Minimum value for compression parameter
  1043. HASHLOG_MAX
  1044. Maximum value for compression parameter
  1045. SEARCHLOG_MIN
  1046. Minimum value for compression parameter
  1047. SEARCHLOG_MAX
  1048. Maximum value for compression parameter
  1049. MINMATCH_MIN
  1050. Minimum value for compression parameter
  1051. MINMATCH_MAX
  1052. Maximum value for compression parameter
  1053. SEARCHLENGTH_MIN
  1054. Minimum value for compression parameter
  1055. Deprecated: use ``MINMATCH_MIN``
  1056. SEARCHLENGTH_MAX
  1057. Maximum value for compression parameter
  1058. Deprecated: use ``MINMATCH_MAX``
  1059. TARGETLENGTH_MIN
  1060. Minimum value for compression parameter
  1061. STRATEGY_FAST
  1062. Compression strategy
  1063. STRATEGY_DFAST
  1064. Compression strategy
  1065. STRATEGY_GREEDY
  1066. Compression strategy
  1067. STRATEGY_LAZY
  1068. Compression strategy
  1069. STRATEGY_LAZY2
  1070. Compression strategy
  1071. STRATEGY_BTLAZY2
  1072. Compression strategy
  1073. STRATEGY_BTOPT
  1074. Compression strategy
  1075. STRATEGY_BTULTRA
  1076. Compression strategy
  1077. STRATEGY_BTULTRA2
  1078. Compression strategy
  1079. FORMAT_ZSTD1
  1080. Zstandard frame format
  1081. FORMAT_ZSTD1_MAGICLESS
  1082. Zstandard frame format without magic header
  1083. Performance Considerations
  1084. --------------------------
  1085. The ``ZstdCompressor`` and ``ZstdDecompressor`` types maintain state to a
  1086. persistent compression or decompression *context*. Reusing a ``ZstdCompressor``
  1087. or ``ZstdDecompressor`` instance for multiple operations is faster than
  1088. instantiating a new ``ZstdCompressor`` or ``ZstdDecompressor`` for each
  1089. operation. The differences are magnified as the size of data decreases. For
  1090. example, the difference between *context* reuse and non-reuse for 100,000
  1091. 100 byte inputs will be significant (possiby over 10x faster to reuse contexts)
  1092. whereas 10 100,000,000 byte inputs will be more similar in speed (because the
  1093. time spent doing compression dwarfs time spent creating new *contexts*).
  1094. Buffer Types
  1095. ------------
  1096. The API exposes a handful of custom types for interfacing with memory buffers.
  1097. The primary goal of these types is to facilitate efficient multi-object
  1098. operations.
  1099. The essential idea is to have a single memory allocation provide backing
  1100. storage for multiple logical objects. This has 2 main advantages: fewer
  1101. allocations and optimal memory access patterns. This avoids having to allocate
  1102. a Python object for each logical object and furthermore ensures that access of
  1103. data for objects can be sequential (read: fast) in memory.
  1104. BufferWithSegments
  1105. ^^^^^^^^^^^^^^^^^^
  1106. The ``BufferWithSegments`` type represents a memory buffer containing N
  1107. discrete items of known lengths (segments). It is essentially a fixed size
  1108. memory address and an array of 2-tuples of ``(offset, length)`` 64-bit
  1109. unsigned native endian integers defining the byte offset and length of each
  1110. segment within the buffer.
  1111. Instances behave like containers.
  1112. ``len()`` returns the number of segments within the instance.
  1113. ``o[index]`` or ``__getitem__`` obtains a ``BufferSegment`` representing an
  1114. individual segment within the backing buffer. That returned object references
  1115. (not copies) memory. This means that iterating all objects doesn't copy
  1116. data within the buffer.
  1117. The ``.size`` attribute contains the total size in bytes of the backing
  1118. buffer.
  1119. Instances conform to the buffer protocol. So a reference to the backing bytes
  1120. can be obtained via ``memoryview(o)``. A *copy* of the backing bytes can also
  1121. be obtained via ``.tobytes()``.
  1122. The ``.segments`` attribute exposes the array of ``(offset, length)`` for
  1123. segments within the buffer. It is a ``BufferSegments`` type.
  1124. BufferSegment
  1125. ^^^^^^^^^^^^^
  1126. The ``BufferSegment`` type represents a segment within a ``BufferWithSegments``.
  1127. It is essentially a reference to N bytes within a ``BufferWithSegments``.
  1128. ``len()`` returns the length of the segment in bytes.
  1129. ``.offset`` contains the byte offset of this segment within its parent
  1130. ``BufferWithSegments`` instance.
  1131. The object conforms to the buffer protocol. ``.tobytes()`` can be called to
  1132. obtain a ``bytes`` instance with a copy of the backing bytes.
  1133. BufferSegments
  1134. ^^^^^^^^^^^^^^
  1135. This type represents an array of ``(offset, length)`` integers defining segments
  1136. within a ``BufferWithSegments``.
  1137. The array members are 64-bit unsigned integers using host/native bit order.
  1138. Instances conform to the buffer protocol.
  1139. BufferWithSegmentsCollection
  1140. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  1141. The ``BufferWithSegmentsCollection`` type represents a virtual spanning view
  1142. of multiple ``BufferWithSegments`` instances.
  1143. Instances are constructed from 1 or more ``BufferWithSegments`` instances. The
  1144. resulting object behaves like an ordered sequence whose members are the
  1145. segments within each ``BufferWithSegments``.
  1146. ``len()`` returns the number of segments within all ``BufferWithSegments``
  1147. instances.
  1148. ``o[index]`` and ``__getitem__(index)`` return the ``BufferSegment`` at
  1149. that offset as if all ``BufferWithSegments`` instances were a single
  1150. entity.
  1151. If the object is composed of 2 ``BufferWithSegments`` instances with the
  1152. first having 2 segments and the second have 3 segments, then ``b[0]``
  1153. and ``b[1]`` access segments in the first object and ``b[2]``, ``b[3]``,
  1154. and ``b[4]`` access segments from the second.
  1155. Choosing an API
  1156. ===============
  1157. There are multiple APIs for performing compression and decompression. This is
  1158. because different applications have different needs and the library wants to
  1159. facilitate optimal use in as many use cases as possible.
  1160. From a high-level, APIs are divided into *one-shot* and *streaming*: either you
  1161. are operating on all data at once or you operate on it piecemeal.
  1162. The *one-shot* APIs are useful for small data, where the input or output
  1163. size is known. (The size can come from a buffer length, file size, or
  1164. stored in the zstd frame header.) A limitation of the *one-shot* APIs is that
  1165. input and output must fit in memory simultaneously. For say a 4 GB input,
  1166. this is often not feasible.
  1167. The *one-shot* APIs also perform all work as a single operation. So, if you
  1168. feed it large input, it could take a long time for the function to return.
  1169. The streaming APIs do not have the limitations of the simple API. But the
  1170. price you pay for this flexibility is that they are more complex than a
  1171. single function call.
  1172. The streaming APIs put the caller in control of compression and decompression
  1173. behavior by allowing them to directly control either the input or output side
  1174. of the operation.
  1175. With the *streaming input*, *compressor*, and *decompressor* APIs, the caller
  1176. has full control over the input to the compression or decompression stream.
  1177. They can directly choose when new data is operated on.
  1178. With the *streaming ouput* APIs, the caller has full control over the output
  1179. of the compression or decompression stream. It can choose when to receive
  1180. new data.
  1181. When using the *streaming* APIs that operate on file-like or stream objects,
  1182. it is important to consider what happens in that object when I/O is requested.
  1183. There is potential for long pauses as data is read or written from the
  1184. underlying stream (say from interacting with a filesystem or network). This
  1185. could add considerable overhead.
  1186. Thread Safety
  1187. =============
  1188. ``ZstdCompressor`` and ``ZstdDecompressor`` instances have no guarantees
  1189. about thread safety. Do not operate on the same ``ZstdCompressor`` and
  1190. ``ZstdDecompressor`` instance simultaneously from different threads. It is
  1191. fine to have different threads call into a single instance, just not at the
  1192. same time.
  1193. Some operations require multiple function calls to complete. e.g. streaming
  1194. operations. A single ``ZstdCompressor`` or ``ZstdDecompressor`` cannot be used
  1195. for simultaneously active operations. e.g. you must not start a streaming
  1196. operation when another streaming operation is already active.
  1197. The C extension releases the GIL during non-trivial calls into the zstd C
  1198. API. Non-trivial calls are notably compression and decompression. Trivial
  1199. calls are things like parsing frame parameters. Where the GIL is released
  1200. is considered an implementation detail and can change in any release.
  1201. APIs that accept bytes-like objects don't enforce that the underlying object
  1202. is read-only. However, it is assumed that the passed object is read-only for
  1203. the duration of the function call. It is possible to pass a mutable object
  1204. (like a ``bytearray``) to e.g. ``ZstdCompressor.compress()``, have the GIL
  1205. released, and mutate the object from another thread. Such a race condition
  1206. is a bug in the consumer of python-zstandard. Most Python data types are
  1207. immutable, so unless you are doing something fancy, you don't need to
  1208. worry about this.
  1209. Note on Zstandard's *Experimental* API
  1210. ======================================
  1211. Many of the Zstandard APIs used by this module are marked as *experimental*
  1212. within the Zstandard project.
  1213. It is unclear how Zstandard's C API will evolve over time, especially with
  1214. regards to this *experimental* functionality. We will try to maintain
  1215. backwards compatibility at the Python API level. However, we cannot
  1216. guarantee this for things not under our control.
  1217. Since a copy of the Zstandard source code is distributed with this
  1218. module and since we compile against it, the behavior of a specific
  1219. version of this module should be constant for all of time. So if you
  1220. pin the version of this module used in your projects (which is a Python
  1221. best practice), you should be shielded from unwanted future changes.
  1222. Donate
  1223. ======
  1224. A lot of time has been invested into this project by the author.
  1225. If you find this project useful and would like to thank the author for
  1226. their work, consider donating some money. Any amount is appreciated.
  1227. .. image:: https://www.paypalobjects.com/en_US/i/btn/btn_donate_LG.gif
  1228. :target: https://www.paypal.com/cgi-bin/webscr?cmd=_donations&business=gregory%2eszorc%40gmail%2ecom&lc=US&item_name=python%2dzstandard&currency_code=USD&bn=PP%2dDonationsBF%3abtn_donate_LG%2egif%3aNonHosted
  1229. :alt: Donate via PayPal
  1230. .. |ci-status| image:: https://dev.azure.com/gregoryszorc/python-zstandard/_apis/build/status/indygreg.python-zstandard?branchName=master
  1231. :target: https://dev.azure.com/gregoryszorc/python-zstandard/_apis/build/status/indygreg.python-zstandard?branchName=master