Arrays, Strings, and Algorithmic Basics
Manipulate collections of data and reason about elementary searching and sorting.
Content
ASCII and Unicode Basics
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
ASCII and Unicode Basics — Bytes, Code Points, and Why Your Strings Lie to You
"Remember when a character was a single byte and life was simple? That was ASCII. Then Unicode showed up and everything got interesting."
You're coming into this after learning about arrays, indexing, and strings with null terminators in C. Good — you already know that a C string is an array of bytes terminated by '\0'. Now let's ask: what do those bytes actually mean? That's where ASCII and Unicode come in.
What is ASCII? (Short, powerful, vintage)
- ASCII = American Standard Code for Information Interchange.
- It maps 128 characters to numbers 0–127. Think: letters, digits, punctuation, and control characters (like newline, tab).
Micro explanation
- Character = human notion (e.g., 'A').
- Code point = numeric value assigned (e.g., 65 for 'A').
- Byte = 8 bits storing the code (ASCII fits in one byte).
In C, a char is essentially a small integer. So:
char c = 'A';
printf("%c %d\n", c, c); // prints: A 65
Yes: 'A' + 1 gives 'B' because these are just numbers under the hood. This is why for (int i = '0'; i <= '9'; i++) is a thing.
Why ASCII matters in CS50 strings and arrays
- When you index a
char[], each element is one ASCII byte. strlencounts bytes until\0. For ASCII text, bytes == characters. Simple.
Unicode: The global upgrade (and the plot twist)
ASCII was great until people used languages other than English. Unicode solves that by giving a unique code point to pretty much every character you can imagine: letters, emojis, hieroglyphs, dingbats.
- Unicode code points are written like U+0041 (which is 'A').
- There are over a million code points defined (but only a subset are assigned today).
Encodings: How we pack code points into bytes
Unicode is an abstract mapping of characters to numbers. Encodings are the practical rules for storing those numbers as bytes. The most important encoding to know is UTF-8.
- UTF-8: variable-length, 1–4 bytes per code point. Backwards-compatible with ASCII: ASCII characters are encoded as single bytes with the same values 0–127.
- UTF-16: 2 or 4 bytes per code point. Used internally by some systems (Windows, JavaScript historically used UCS-2/UTF-16).
- UTF-32: fixed 4 bytes per code point (simple but memory-heavy).
Important property: ASCII ⊂ UTF-8
If your text is plain ASCII, it is also valid UTF-8 and identical byte-for-byte. That’s why many old C programs “just worked” when you moved to UTF-8 files — until you met 'ô' or '😊'.
Example: A tiny betrayal by bytes
Let's compare 'A' (ASCII) and '€' (Euro sign) in UTF-8.
- 'A' -> U+0041 -> 0x41 (1 byte)
- '€' -> U+20AC -> 0xE2 0x82 0xAC (3 bytes)
C code showing bytes (note: treat as unsigned to print correctly):
unsigned char s[] = "A€"; // in a UTF-8 source file
for (size_t i = 0; i < sizeof(s); i++)
printf("byte %zu = 0x%02x\n", i, s[i]);
// Output might be:
// byte 0 = 0x41
// byte 1 = 0xe2
// byte 2 = 0x82
// byte 3 = 0xac
// byte 4 = 0x00 <-- null terminator
Note: strlen(s) will return 4 (bytes before the \0), not 2 characters.
Practical consequences for algorithms and strings
- Indexing is by byte, not by human character. If you do
s[1]on a UTF-8 string, you might land in the middle of a multi-byte code point and get nonsense. - strlen(s) is O(n) in bytes. For algorithms, remember the difference between counting bytes and counting code points.
- Sorting/comparison: For pure ASCII text, lexicographic comparison of bytes works as expected. For Unicode-aware sorting (collation), rules get complex: 'Å' might sort near 'A' or 'Z' depending on locale.
- Memory/time tradeoffs: UTF-8 is space-efficient for ASCII-heavy text. UTF-32 makes indexing by code point O(1) but costs 4× space.
Algorithmic tip
If you need to process user-visible characters (grapheme clusters), use a library that understands Unicode. Trying to implement correct Unicode handling from scratch is like trying to fold a fitted sheet perfectly on the first try — theoretically possible, practically painful.
C and Unicode: What should a CS50 student do?
- For simple exercises (ASCII-only inputs), keep using
char[]andstrlen()— everything behaves as you've learned. - When handling general user input (international text, emojis), be aware:
- Files and terminals commonly use UTF-8.
- Use
unsigned charwhen inspecting raw bytes to avoid sign-extension issues. - Prefer libraries: ICU, iconv, or platform-specific APIs for full Unicode support.
Small examples:
// Counting bytes vs code points is different.
char *s = "π"; // U+03C0, UTF-8 encoding: 0xCF 0x80
printf("bytes: %zu\n", strlen(s));
// To count actual code points you'd have to decode UTF-8 sequences.
Or decode manually (high-level idea):
- Read a byte.
- If top bit is 0 → 1-byte code point.
- If starts with 110 → 2 bytes total.
- If starts with 1110 → 3 bytes total.
- If starts with 11110 → 4 bytes total.
- Verify continuation bytes start with 10.
But remember: that counts code points, not user-perceived characters (grapheme clusters).
Quick reference: Common pitfalls
- Using
strlento measure user-visible characters for multilingual text. - Indexing
char[]to jump between characters in UTF-8 (may break on multibyte chars). - Assuming
charalways holds a full character — in Unicode world, it often doesn't.
Key takeaways
- ASCII: simple, single-byte (0–127). Your early C work uses this model implicitly.
- Unicode: universal code points. UTF-8 is the dominant encoding — variable-length but ASCII-compatible.
- In C, a string is a sequence of bytes ending with
\0. Bytes ≠ characters when using UTF-8. - For correct multilingual text handling, use established libraries and be explicit about encodings.
"If ASCII is the cozy studio apartment of characters, Unicode is the entire city. UTF-8 is the subway map — efficient, but sometimes you need to transfer lines to really get where you're going."
Further reading suggestions: UTF-8 on Wikipedia, the Unicode Standard overview, and the ICU library docs. If you want, I can walk through a UTF-8 decoder implementation in C next — it's a great way to see indexing, arrays, and control flow collide in a satisfying mess.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!