Chapters

1Computational Thinking and Foundations

2C Language Basics

3Arrays, Strings, and Algorithmic Basics

Arrays and Indexing Strings and Null Terminators ASCII and Unicode Basics Command-Line Arguments String Libraries and strlen Tokenizing and strtok Linear Search Selection Sort Bubble Sort Stable vs Unstable Sorts Tracing and Dry Runs Off-by-One Errors Sentinels and Guards Input Validation Algorithmic Correctness

4Algorithm Efficiency and Recursion

5Memory, Pointers, and File I/O

6Core Data Structures in C

7Python Fundamentals

8Object-Oriented and Advanced Python

9Relational Databases and SQL

10Web Foundations: HTML, CSS, and JavaScript

11Servers and Flask Web Applications

12Cybersecurity and Privacy Essentials

13Software Engineering Practices

14Version Control and Collaboration

15Capstone: Designing, Building, and Presenting

Courses/CS50 - Introduction to Computer Science/Arrays, Strings, and Algorithmic Basics

Arrays, Strings, and Algorithmic Basics

8215 views

Manipulate collections of data and reason about elementary searching and sorting.

Content

3 of 15

ASCII and Unicode Basics

ASCII and Unicode Basics Explained for CS50 Students

3663 views

beginner

humorous

computer science

strings

gpt-5-mini

3663 views

Versions:

ASCII and Unicode Basics Explained for CS50 Students

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

ASCII and Unicode Basics — Bytes, Code Points, and Why Your Strings Lie to You

"Remember when a character was a single byte and life was simple? That was ASCII. Then Unicode showed up and everything got interesting."

You're coming into this after learning about arrays, indexing, and strings with null terminators in C. Good — you already know that a C string is an array of bytes terminated by '\0'. Now let's ask: what do those bytes actually mean? That's where ASCII and Unicode come in.

What is ASCII? (Short, powerful, vintage)

ASCII = American Standard Code for Information Interchange.
It maps 128 characters to numbers 0–127. Think: letters, digits, punctuation, and control characters (like newline, tab).

Micro explanation

Character = human notion (e.g., 'A').
Code point = numeric value assigned (e.g., 65 for 'A').
Byte = 8 bits storing the code (ASCII fits in one byte).

In C, a char is essentially a small integer. So:

char c = 'A';
printf("%c %d\n", c, c); // prints: A 65

Yes: 'A' + 1 gives 'B' because these are just numbers under the hood. This is why for (int i = '0'; i <= '9'; i++) is a thing.

Why ASCII matters in CS50 strings and arrays

When you index a char[], each element is one ASCII byte.
strlen counts bytes until \0. For ASCII text, bytes == characters. Simple.

Unicode: The global upgrade (and the plot twist)

ASCII was great until people used languages other than English. Unicode solves that by giving a unique code point to pretty much every character you can imagine: letters, emojis, hieroglyphs, dingbats.

Unicode code points are written like U+0041 (which is 'A').
There are over a million code points defined (but only a subset are assigned today).

Encodings: How we pack code points into bytes

Unicode is an abstract mapping of characters to numbers. Encodings are the practical rules for storing those numbers as bytes. The most important encoding to know is UTF-8.

UTF-8: variable-length, 1–4 bytes per code point. Backwards-compatible with ASCII: ASCII characters are encoded as single bytes with the same values 0–127.
UTF-16: 2 or 4 bytes per code point. Used internally by some systems (Windows, JavaScript historically used UCS-2/UTF-16).
UTF-32: fixed 4 bytes per code point (simple but memory-heavy).

Important property: ASCII ⊂ UTF-8

If your text is plain ASCII, it is also valid UTF-8 and identical byte-for-byte. That’s why many old C programs “just worked” when you moved to UTF-8 files — until you met 'ô' or '😊'.

Example: A tiny betrayal by bytes

Let's compare 'A' (ASCII) and '€' (Euro sign) in UTF-8.

'A' -> U+0041 -> 0x41 (1 byte)
'€' -> U+20AC -> 0xE2 0x82 0xAC (3 bytes)

C code showing bytes (note: treat as unsigned to print correctly):

unsigned char s[] = "A€"; // in a UTF-8 source file
for (size_t i = 0; i < sizeof(s); i++)
    printf("byte %zu = 0x%02x\n", i, s[i]);
// Output might be:
// byte 0 = 0x41
// byte 1 = 0xe2
// byte 2 = 0x82
// byte 3 = 0xac
// byte 4 = 0x00   <-- null terminator

Note: strlen(s) will return 4 (bytes before the \0), not 2 characters.

Practical consequences for algorithms and strings

Indexing is by byte, not by human character. If you do s[1] on a UTF-8 string, you might land in the middle of a multi-byte code point and get nonsense.
strlen(s) is O(n) in bytes. For algorithms, remember the difference between counting bytes and counting code points.
Sorting/comparison: For pure ASCII text, lexicographic comparison of bytes works as expected. For Unicode-aware sorting (collation), rules get complex: 'Å' might sort near 'A' or 'Z' depending on locale.
Memory/time tradeoffs: UTF-8 is space-efficient for ASCII-heavy text. UTF-32 makes indexing by code point O(1) but costs 4× space.

Algorithmic tip

If you need to process user-visible characters (grapheme clusters), use a library that understands Unicode. Trying to implement correct Unicode handling from scratch is like trying to fold a fitted sheet perfectly on the first try — theoretically possible, practically painful.

C and Unicode: What should a CS50 student do?

For simple exercises (ASCII-only inputs), keep using char[] and strlen() — everything behaves as you've learned.
When handling general user input (international text, emojis), be aware:
- Files and terminals commonly use UTF-8.
- Use unsigned char when inspecting raw bytes to avoid sign-extension issues.
- Prefer libraries: ICU, iconv, or platform-specific APIs for full Unicode support.

Small examples:

// Counting bytes vs code points is different.
char *s = "π"; // U+03C0, UTF-8 encoding: 0xCF 0x80
printf("bytes: %zu\n", strlen(s));
// To count actual code points you'd have to decode UTF-8 sequences.

Or decode manually (high-level idea):

Read a byte.
If top bit is 0 → 1-byte code point.
If starts with 110 → 2 bytes total.
If starts with 1110 → 3 bytes total.
If starts with 11110 → 4 bytes total.
Verify continuation bytes start with 10.

But remember: that counts code points, not user-perceived characters (grapheme clusters).

Quick reference: Common pitfalls

Using strlen to measure user-visible characters for multilingual text.
Indexing char[] to jump between characters in UTF-8 (may break on multibyte chars).
Assuming char always holds a full character — in Unicode world, it often doesn't.

Key takeaways

ASCII: simple, single-byte (0–127). Your early C work uses this model implicitly.
Unicode: universal code points. UTF-8 is the dominant encoding — variable-length but ASCII-compatible.
In C, a string is a sequence of bytes ending with \0. Bytes ≠ characters when using UTF-8.
For correct multilingual text handling, use established libraries and be explicit about encodings.

"If ASCII is the cozy studio apartment of characters, Unicode is the entire city. UTF-8 is the subway map — efficient, but sometimes you need to transfer lines to really get where you're going."

Further reading suggestions: UTF-8 on Wikipedia, the Unicode Standard overview, and the ICU library docs. If you want, I can walk through a UTF-8 decoder implementation in C next — it's a great way to see indexing, arrays, and control flow collide in a satisfying mess.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics