MapReduce-MPI WWW Site - MapReduce-MPI Documentation

MapReduce map() method

int MapReduce::map(int nmap, void (*mymap)(int, KeyValue *, void *), void *ptr)
int MapReduce::map(int nmap, void (*mymap)(int, KeyValue *, void *), void *ptr, int addflag) 
int MapReduce::map(char *file, void (*mymap)(int, char *, KeyValue *, void *), void *ptr)
int MapReduce::map(char *file, void (*mymap)(int, char *, KeyValue *, void *), void *ptr, int addflag) 
int MapReduce::map(int nmap, int nfiles, char **files, char sepchar, int delta, void (*mymap)(int, char *, int, KeyValue *, void *), void *ptr)
int MapReduce::map(int nmap, int nfiles, char **files, char sepchar, int delta, void (*mymap)(int, char *, int, KeyValue *, void *), void *ptr, int addflag) 
int MapReduce::map(int nmap, int nfiles, char **files, char *sepstr, int delta, void (*mymap)(int, char *, int, KeyValue *, void *), void *ptr)
int MapReduce::map(int nmap, int nfiles, char **files, char *sepstr, int delta, void (*mymap)(int, char *, int, KeyValue *, void *), void *ptr, int addflag) 
int MapReduce::map(MapReduce *mr2, void (*mymap)(uint64_t, char *, int, char *, int, KeyValue *, void *), void *ptr)
int MapReduce::map(MapReduce *mr2, void (*mymap)(uint64_t, char *, int, char *, int, KeyValue *, void *), void *ptr, int addflag) 

This calls the map() method of a MapReduce object. A function pointer to a mapping function you write is specified as an argument. This method either creates a new KeyValue object to store all the key/value pairs generated by your mymap function, or adds them to an existing KeyValue object. The method returns the total number of key/value pairs in the KeyValue object.

For the first set of variants (with and without addflag) you specify a total number of map tasks nmap to perform across all processors. The index of a map task is passed back to your mymap() function.

For the second set of variants you specify a master file that contains a list of filenames. A filename is passed back to your mymap() function. The master file should list one file per line. Blank lines are not allowed. Leading and trailing whitespace around the filename is OK.

For the third set of variants you specify an array of one or more file names and a separation character (sepchar). For the fourth set of variants, you specify an array of one or more files names and a separation string (sepstr). The file(s) are split into nmap chunks with roughly equal numbers of bytes in each chunk. One chunk from one file is read and passed back to your mymap() function, so your code does not read the file. See details below about the splitting methodology and the delta input parameter.

For the fifth set of variants, you specify an existing MapReduce object mr2 with key/value pairs, which can either be this MapReduce object or another one. The key/value pairs from mr2 are passed back to your mymap() function, one key/value at a time, allowing you to generate new key/value pairs from an existing set.

You can give any of the map() methods a pointer (void *ptr) which will be returned to your mymap() function. See the Technical Details section for why this can be useful. Just specify a NULL if you don't need this.

If the last argument addflag is omitted or is specified as 0, then map() will create a new KeyValue object, deleting any existing KeyValue object. If addflag is non-zero, then key/value pairs generated by your mymap() function are added to an existing KeyValue object, which is created if needed.

If the fifth map() variant is called using the MapReduce object itself as an argument, and if addflag is 0, then the existing KeyValue object is effectively replaced by the newly generated key/value pairs. If addflag is non-zero, then the newly generated key/value pairs are added to the existing KeyValue object.

In this example the user function is called mymap() and it has one of four interfaces depending on which variant of the map() method is invoked:

void mymap(int itask, KeyValue *kv, void *ptr)
void mymap(int itask, char *file, KeyValue *kv, void *ptr)
void mymap(int itask, char *str, int size, KeyValue *kv, void *ptr)
void mymap(uint64_t itask, char *key, int keybytes, char *value, int valuebytes, KeyValue *kv, void *ptr) 

In all cases, the final 2 arguments passed to your function are a pointer to a KeyValue object (kv) stored internally by the MapReduce object, and the original pointer you specified as an argument to the map() method, as void *ptr.

In the first case, itask is passed to your function with a value 0 <= itask < nmap, where nmap was specified in the map() call. For example, you could use itask to select a file from a list stored by your application. Your mymap() function could open and read the file or perform some other operation.

In the second case, itask will have a value 0 <= itask < nfiles, where nfiles is the number of filenames in the master file you specified. Your function is also passed a single filename, which it will presumably open and read.

In the third case, itask will have a value from 0 <= itask < nmap, where nmap was specified in the map() call and is the number of file segments generated. It is also passed a string of bytes (str) of length size from one of the files. Size includes a trailing '\0' that is appended to the string.

For map() methods that take files and a separation criterion as arguments, you must specify nmap >= nfiles, so that there is one or more map tasks per file. For files that are split into multiple chunks, the split is done at occurrences of the separation character or string. You specify a delta of how many extra bytes to read with each chunk that will guarantee the splitting character or string is found within that many bytes. For example if the files are lines of text, you could choose a newline character '\n' as the sepchar, and a delta of 80 (if the longest line in your files is 80 characters). If the files are snapshots of simulation data where each snapshot is 1000 lines (no more than 80 characters per line), you could choose the first line of each snapshot (e.g. "Snapshot") as the sepstr, and a delta of 80000. Note that if the separation character or string is not found within delta bytes, an error will be generated. Also note that there is no harm in choosing a large delta so long as it is not larger than the chunk size for a particular file.

If the separation criterion is a character (sepchar), the chunk of bytes passed to your mymap() function will start with the character after a sepchar, and will end with a sepchar (followed by a '\0'). If the separation criterion is a string (sepstr), the chunk of bytes passed to your mymap() function will start with sepstr, and will end with the character immediately preceeding a sepstr (followed by a '\0'). Note that this means your mymap() function will be passed different byte strings if you specify sepchar = 'A' vs sepstr = "A".

In the fourth case, itask will have a value from 0 <= itask < nkey, where nkey is a unsigned 64-bit int and is the number of key/value pairs in the specified MapReduce object. Key and value are the byte strings for a single key/value pair and are of length keybytes and valuebytes respectively.

The MapReduce library assigns map tasks to processors. Options for how it does this can be controlled by MapReduce settings. Basically, nmap/P tasks are assigned to each processor, where P is the number of processors in the MPI communicator you instantiated the MapReduce object with.

Typically, your mymap() function will produce key/value pairs which it registers with the MapReduce object by calling the add() method of the KeyValue object. The syntax for registration is described on the doc page of the KeyValue add() method.

See the Settings and Technical Details sections for details on the byte-alignment of keys and values you register with the KeyValue add() methods or that are passed to your mymap() function.

Aside from the assignment of tasks to processors, this method is really an on-processor operation, requiring no communication. When run in parallel, each processor generates key/value pairs and stores them, independently of other processors.


Related methods: Keyvalue add(), reduce()