Tokenize a String in C++

Filed Under: C++
Tokenize String Cpp

In this article, we’ll take a look at how we can tokenize a C++ String.

Other languages have a very simple solution, using string.split(). Unfortunately, native C++ does not support this method, so we’ll show you the different approaches.

Approach 1: Convert to a C string and use strtok()

To anyone familiar with C, the most obvious approach would be to convert the C++ string into a character array (“C string”), and then use strtok() on the C string.

Since strtok() is the native C tokenizer, this is one possible way.

#include <iostream>
#include <string> // C++ Strings
#include <string.h> // For C-style strtok()

using namespace std;

void c_style_tokenizer(string inp, char* delim) {
    // Tokenizes the input string and prints the output
    // It does not return anything, since this is just for
    // illustration
    const char* c_string = inp.c_str();

    // Tokenize the C string using the delimiter
    char* token = strtok((char*)c_string, delim);

    while (token) {
        printf("Token: %s\n", token);
        // Get next token
        token = strtok(NULL, delim);
    }
}

int main() {
    // Convert the string delimiter to a char* before passing
    // it to the function, since strtok() does not support string arguments
    string input = "Hello from JournalDev";
    cout << "Input String: " << input << endl;
    c_style_tokenizer(input, (char*) " ");
    return 0;
}

Output

Input String: Hello from JournalDev
Token: Hello
Token: from
Token: JournalDev

As you can see, indeed, after converting to a C string using string.c_str(), and processing it using strtok(), we get our tokenized output!

However, this method is prone to certain buffer overflow errors, since strtok() requires that the input string has a \0 terminating character.

For some reason, if our input string does not have it, it may result in errors. Also, if your program uses multiple threads, this approach may fail, since strtok() uses a global variable to keep track of the current position.

Due to potential caveats, we’ll present some more suitable approaches.

Approach 2: Use Regex Token Iterators (Recommended)

Another approach is to use sregex_token_iterator, included in the <regex> header file. This is the recommended method for modern C++, as it uses some of the STL methods.

If we want to parse out white-spaces, we first construct a regex object, using the regular expression string “\s+”, meaning capture at-least one or more spaces in succession:

regex reg("\\s+");

We need to escape the backslash in “\s+”, so the final string becomes “\\s+”.

Now that we have our regex pattern, we can use regex_token_iterator and construct a <vector> of strings using the iterator, after performing the regex match.

// Courtesy: https://stackoverflow.com/a/27468529

#include <iostream>
#include <regex>
#include <string>

using namespace std;

int main()
{
    string str("Hello from JournalDev");
    
    // Regex for tokenizing whitespaces
    regex reg("\\s+");

    // Get an iterator after filtering through the regex
    sregex_token_iterator iter(str.begin(), str.end(), reg, -1);
    // Keep a dummy end iterator - Needed to construct a vector
    // using (start, end) iterators.
    sregex_token_iterator end;

    vector<string> vec(iter, end);

    for (auto a : vec)
    {
        cout << a << endl;
    }
}

This now gives you the tokenized string, as a vector. This is very convenient too, as you get get the strings on demand!

Output

Hello
from
JournalDev

While this is a robust method, you must take care if you’re using large strings, as the regular expressions do take a toll on the performance, and are not the most efficient.

Approach 3: Use the <boost> library

If your current requirements allow the use of external libraries like <boost>, you may use this approach.

We can use the boost::algorithm::split function here, to tokenize our input string.

// Courtesy - https://stackoverflow.com/a/59552
#include <vector>
#include <boost/algorithm/string.hpp>

int main() {
    auto s = "a,b, c ,,e,f,";
    std::vector<std::string> fields;
    boost::split(fields, s, boost::is_any_of(","));
    for (const auto& field : fields)
        std::cout << "\"" << field << "\"\n";
    return 0;
}

Output

"a"
"b"
" c "
""
"e"
"f"
""

This indeed gives us our tokenized output, after parsing out the commas.


Conclusion

In this article, we showed you how you could tokenize a string in C++, using different approaches.


References


Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages