ml.rl.readers package

Submodules

ml.rl.readers.base module

class ml.rl.readers.base.ReaderBase(batch_size=None, drop_small=True, num_shards=None)

Bases: object

do_get_shard(shard_id: int)

Subclass should implement this if the reader is shardable

get_shard(shard_id: int)

Returns a shard of this reader

class ml.rl.readers.base.ReaderIter

Bases: object

abstract read_batch() → Optional[collections.OrderedDict]

Read a batch of data. The return value should be an OrderedDict. Returns None when there is no more data.

ml.rl.readers.data_streamer module

class ml.rl.readers.data_streamer.DataStreamer(data_reader, num_workers=0, pin_memory=False, timeout=0, worker_init_fn=None)

Bases: object

Data streamer. Provides single- or multi-process iterators over the data_reader.

Parameters
  • data_reader (DataReader) – data_reader from which to stream the data.

  • num_workers (int, optional) – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 0)

  • pin_memory (bool, optional) – If True, the data streamer will copy tensors into CUDA pinned memory before returning them.

  • timeout (numeric, optional) – if positive, the timeout value for collecting a batch from workers. Should always be non-negative. (default: 0)

  • worker_init_fn (callable, optional) – If not None, this will be called on each worker subprocess with the worker id (an int in [0, num_workers - 1]) as input, after seeding and before data loading. (default: None)

Note

By default, each worker will have its PyTorch seed set to base_seed + worker_id, where base_seed is a long generated by main process using its RNG. However, seeds for other libraries may be duplicated upon initializing workers (w.g., NumPy), causing each worker to return identical random numbers. (See datastreamer-workers-random-seed section in FAQ.) You may use torch.initial_seed() to access the PyTorch seed for each worker in worker_init_fn, and use it to set other seeds before data loading.

Warning

If spawn start method is used, worker_init_fn cannot be an unpickleable object, e.g., a lambda function.

class ml.rl.readers.data_streamer.WorkerDone(worker_id)

Bases: tuple

property worker_id

Alias for field number 0

ml.rl.readers.data_streamer.pin_memory(batch)

This is ripped off from dataloader. The only difference is that it preserves the type of Mapping so that the OrderedDict is maintained.

ml.rl.readers.json_dataset_reader module

class ml.rl.readers.json_dataset_reader.JSONDatasetReader(path, batch_size=None, preprocess_handler=None)

Bases: ml.rl.readers.base.ReaderBase

Create the reader for a JSON training dataset.

line_count()
read_all()
read_batch()
reset_iterator()
class ml.rl.readers.json_dataset_reader.JSONDatasetReaderIter(reader)

Bases: ml.rl.readers.base.ReaderIter

read_batch() → Optional[collections.OrderedDict]

Read a batch of data. The return value should be an OrderedDict. Returns None when there is no more data.

ml.rl.readers.nparray_reader module

class ml.rl.readers.nparray_reader.NpArrayReader(data, size=None, **kwargs)

Bases: ml.rl.readers.base.ReaderBase

Basic reader taking np.ndarray`s of a whole dataset and split them into chunks of `batch_size.

do_get_shard(shard_id: int)

Subclass should implement this if the reader is shardable

class ml.rl.readers.nparray_reader.NpArrayReaderIter(reader)

Bases: ml.rl.readers.base.ReaderIter

read_batch() → Optional[collections.OrderedDict]

Read a batch of data. The return value should be an OrderedDict. Returns None when there is no more data.

Module contents