Skip to main content

Programmatically Detect Emoji in Text with Python

Β· 5 min read
Serhii Hrekov
software engineer, creator, artist, programmer, projects founder

πŸ”Ž How to Programmatically Detect Emoji in Text with Python​

Programmatically detecting and extracting emoji from text is a common task in data science and natural language processing (NLP). Unlike standard ASCII characters, emojis are complex Unicode characters or sequences that can span multiple code points, making simple string checks or basic regular expressions unreliable.

The most robust and recommended approach in Python is to use a specialized third-party library that maintains the latest list of Unicode emoji definitions.

The emoji library is the de facto standard for working with emojis in Python. It includes functions not only for conversion but also for precise detection and analysis based on the latest Unicode standards.

Installation​

pip install emoji

A. Checking if a String Contains Emoji​

The emoji library provides a way to check if a string contains any emoji using a generated regular expression.

import emoji
import re

text_with_emoji = "Python is fun! πŸπŸ’»πŸ”₯"
text_without_emoji = "This text is clean."

# Get the compiled regex pattern that matches all known emojis
emoji_pattern = emoji.get_emoji_regexp()

# Check for presence
contains_emoji = bool(emoji_pattern.search(text_with_emoji))
does_not_contain = bool(emoji_pattern.search(text_without_emoji))

print(f"'{text_with_emoji}' contains emoji: {contains_emoji}") # True
print(f"'{text_without_emoji}' contains emoji: {does_not_contain}") # False

B. Extracting All Emojis from Text​

To retrieve a list of all emojis found in the text, including those made up of multiple characters (like family or skin tone modifiers):

def extract_emojis(text):
"""Returns a list of all emojis found in the text."""
return emoji.emoji_list(text)

sample_text = "I love this library! πŸ‘πŸ½ The astronaut πŸ‘©β€πŸš€ is cool."
emojis_found = extract_emojis(sample_text)

print(emojis_found)
# Output:
# [{'match_start': 22, 'match_end': 25, 'emoji': 'πŸ‘πŸ½'},
# {'match_start': 41, 'match_end': 44, 'emoji': 'πŸ‘©β€πŸš€'}]

# To get just the emoji characters:
emoji_chars = [match['emoji'] for match in emojis_found]
print(emoji_chars) # ['πŸ‘πŸ½', 'πŸ‘©β€πŸš€']

C. Checking if a Single String is an Emoji​

To confirm if a short string is a single, valid emoji:

print(emoji.is_emoji('❀️')) # True
print(emoji.is_emoji('πŸ‘©β€πŸ’»')) # True (even though it's multiple code points)
print(emoji.is_emoji('Hello')) # False

2. The Low-Level Method: Using Unicode Ranges (Advanced/Legacy)​

Emojis are assigned specific ranges within the Unicode standard. While using a library is simpler, you can detect emojis manually by checking if a character's Unicode code point falls within these defined blocks.

Caveat: This method is highly discouraged for production code because:

  1. Composite Emojis: It fails to detect sequences like skin tone modifiers ($\text{πŸ‘} + \text{🏿} = \text{πŸ‘πŸΏ}$) or zero-width joiner (ZWJ) sequences ($\text{πŸ‘©} + \text{‍} + \text{πŸ’»} = \text{πŸ‘©β€πŸ’»}$).
  2. Maintenance: The ranges change with every new Unicode/Emoji release, requiring manual updates to your regex [2.4].
import re

# A simplified regex pattern covering major emoji blocks
EMOJI_PATTERN_SIMPLE = re.compile(
"["
"\U0001F600-\U0001F64F" # Emoticons
"\U0001F300-\U0001F5FF" # Miscellaneous Symbols and Pictographs
"\U0001F680-\U0001F6FF" # Transport and Map Symbols
"\U0001F700-\U0001F77F" # Alchemical Symbols
# ... more ranges
"]+", flags=re.UNICODE
)

def contains_emoji_simple(text):
return bool(EMOJI_PATTERN_SIMPLE.search(text))

# Test
print(contains_emoji_simple("Simple happy face πŸ˜€")) # True
print(contains_emoji_simple("Complex woman astronaut πŸ‘©β€πŸš€")) # May fail or partially match

3. Annotation and Use Cases​

MethodAccuracy & RobustnessComplexityRecommended Use Case
emoji LibraryHigh. Handles composite emojis and latest standards.Low (easy to implement).All modern NLP tasks, extraction, and analysis.
Custom Unicode/RegexLow/Medium. Prone to missing newer or complex emojis.High (requires manual maintenance).Simple filtering where dependencies are strictly forbidden.

Using a specialized library like emoji ensures that your code remains accurate as the Unicode standard evolves, especially since many modern emojis are composed of multiple code points that must be treated as a single unit [1.1, 1.5].