Media Intelligence Platform: Multi-Source Content Aggregation for PR Analytics
Enterprise media intelligence platform aggregating content from YouTube, Twitter/X, Reddit, and Malaysian news sources with full-text search, trend analysis, and intelligent caching.
Background
A leading Malaysian public relations agency approached HighFlyer with a complex challenge: their PR consultants were spending countless hours manually tracking media coverage and public discourse across multiple platforms for their corporate clients. With content fragmented across YouTube, Twitter/X, Reddit, and numerous Malaysian news websites, gaining a comprehensive view of brand perception required laborious manual effort.
The agency needed a unified platform that could aggregate content from all these sources, provide powerful search capabilities, and visualise trends over time. This would enable their consultants to deliver data-driven insights to clients efficiently.
The Challenge
Building an enterprise-grade media intelligence platform presented several significant technical challenges:
-
Data Fragmentation: Content was scattered across four major platforms (YouTube, Twitter/X, Reddit) plus six Malaysian news websites, each with different data structures, APIs, and access patterns.
-
Anti-Scraping Measures: Major platforms employ sophisticated anti-bot measures. Twitter/X in particular had deprecated reliable API access, requiring alternative approaches. Several news sites utilise Cloudflare protection.
-
Near Real-Time Requirements: PR professionals needed access to recent content with historical context, not hours-old cached data, but fresh results that could inform time-sensitive media responses.
-
Complex Query Requirements: Users needed to search using multiple terms simultaneously, filter by platform and date range, and view aggregated trend data across all sources.
-
Performance at Scale: With millions of indexed documents, the system needed to return comprehensive search results in under 3 seconds while handling concurrent users.
The Solution by HighFlyer
We architected a three-tier solution comprising a Python-based data aggregation service, an OpenSearch indexing layer, and a Next.js web application. Each component was optimised for its specific role in the data pipeline.
Multi-Platform Data Aggregation Service
At the heart of the platform is a FastAPI-powered backend service that orchestrates data collection from all sources. We implemented platform-specific adapters to handle each source’s unique requirements:
YouTube Integration: We leveraged the official YouTube Data API for reliable access to video metadata, channel information, and descriptions. JMESPath expressions transform the nested JSON responses into a consistent schema.
Twitter/X Automation: With Twitter’s API becoming increasingly restrictive, we implemented a Playwright-based browser automation solution. The system maintains authenticated sessions with state persistence, intercepts GraphQL responses from Twitter’s internal SearchTimeline endpoint, and extracts structured tweet data including engagement metrics. The headless browser uses the --headless=new flag for compatibility with modern Chrome behaviour.
Reddit Integration: Using AsyncPRAW (Async Python Reddit API Wrapper), we query Reddit’s OAuth API for posts across all subreddits, capturing titles, authors, comment counts, and upvote scores.
News Site Crawlers: For news sites, we built Scrapy spiders with specialised parsing strategies for each source:
- SSR-based sites: Extract server-rendered data from Next.js
__NEXT_DATA__blocks - Structured data sites: Parse LD+JSON NewsArticle schema markup
- Cloudflare-protected sites: Integrate with FlareSolverr to bypass browser challenges, caching bypass credentials with expiry management
All adapters run concurrently using asyncio.gather(), dramatically reducing total collection time while maintaining fault tolerance. If one platform fails, others continue unaffected.
Search Infrastructure
The OpenSearch cluster serves as the platform’s central nervous system, providing both storage and search capabilities:
Index Architecture: We maintain separate indices for each content type (youtube-results, twitter-results, reddit-results, posts) with custom mappings optimised for each data structure. Date fields use proper timestamp types for efficient range queries.
Intelligent Caching: Rather than hitting external APIs on every search, we implemented a 6-hour cache strategy. Each indexed document carries a searchedAt timestamp and searchedQuery field. When a user searches, the system first checks if matching cached results exist within the TTL window. This reduces API costs by approximately 80% while ensuring data freshness.
Advanced Querying: Search queries use boolean constructions with boosted fields (titles weighted 2x), the AND operator for multi-term precision, and collapse queries to deduplicate results by source ID.
Aggregations: Every search returns not just matching documents but also date histogram aggregations, bucketing results by day to power trend visualisation in the frontend.
Analytics Web Application
The Next.js application provides an intuitive interface for PR professionals:
Multi-Query Search Builder: Users construct searches using a tag-based input, adding multiple search terms that are combined with AND logic. Platform checkboxes allow filtering to specific sources.
Tabbed Results View: Results are organised by platform, with each tab displaying:
- A date histogram chart (Recharts) showing post volume over time
- Paginated results with keyword highlighting
- Direct links to source content
- Engagement metrics (likes, upvotes, retweets, comments)
Saved Queries: Frequently-used query combinations can be saved and re-executed with a single click, streamlining daily monitoring workflows.
Resilient Data Fetching: The frontend uses Promise.allSettled() for API calls, ensuring partial results display even if one platform’s data is unavailable.
Technical Deep-Dive
Handling Anti-Scraping Measures
One of the most technically challenging aspects was maintaining reliable access to platforms that actively discourage automated access:
Playwright Session Management: For Twitter/X, we save authentication state to a JSON file after initial login, allowing subsequent requests to skip the login flow. The browser intercepts network requests to capture GraphQL responses, parsing the complex nested structure with dual JMESPath expressions to handle different response formats (with and without promoted content).
Cloudflare Bypass: Several news sites employ Cloudflare’s browser challenge. We integrate FlareSolverr, a proxy service that solves Cloudflare challenges using a real browser, and then cache the resulting cookies and User-Agent headers. Subsequent requests reuse these credentials until expiry, minimising bypass overhead.
Async Architecture for Performance
Performance was critical given the real-time nature of PR work:
Backend Parallelism: The FastAPI service uses Python’s asyncio.gather() with return_exceptions=True to run all platform scrapers simultaneously. A typical multi-platform search completes in 2-3 seconds rather than the 10+ seconds sequential execution would require.
Frontend Resilience: The Next.js API routes use Promise.allSettled() rather than Promise.all(), ensuring that a timeout on one platform doesn’t block results from others. Users see available results immediately while slow sources continue loading.
Bulk Indexing: New documents are indexed using OpenSearch’s bulk API, batching hundreds of documents in single requests rather than individual insertions.
Data Normalisation Pipeline
With data coming from disparate sources in vastly different formats, normalisation was essential:
JMESPath Transformations: We use JMESPath expressions to declaratively transform complex nested JSON into flat, consistent schemas. This approach is more maintainable than procedural parsing code and easier to update when source formats change.
Timestamp Unification: Different platforms return timestamps in different formats: Unix milliseconds, ISO 8601 strings, and custom date formats. The normalisation layer converts all timestamps to a consistent format before indexing.
Field Aliasing: Platform-specific field names (e.g., favorite_count vs upvote vs likes) are mapped to semantic equivalents, enabling cross-platform queries and consistent frontend rendering.
The Results
The completed platform demonstrated significant capabilities:
-
Dramatic Time Savings: What previously took hours of manual checking across multiple platforms could now be accomplished in seconds with a single search.
-
Comprehensive Coverage: The platform successfully aggregated content from four major platforms and six news sources into a unified search interface.
-
Historical Trend Analysis: Date histogram visualisations revealed patterns over time, providing the foundation for measuring campaign effectiveness and identifying emerging issues.
-
Actionable Insights: The combination of search precision, engagement metrics, and trend data provided the tools needed for data-driven recommendations.
-
Scalable Architecture: The system was designed to handle growing data volumes and concurrent users without performance degradation.
Technology Stack
Backend
- Python with FastAPI for high-performance async API
- Playwright for browser automation
- Scrapy for web crawling
- AsyncPRAW for Reddit API access
- JMESPath for data transformation
Search
- OpenSearch / Elasticsearch for full-text search and aggregations
Frontend
- Next.js with App Router architecture
- React with TypeScript
- Tailwind CSS for styling
- Recharts for data visualisation
Infrastructure
- Docker and Docker Compose for containerisation
- FlareSolverr for Cloudflare bypass
Conclusion
This project exemplifies HighFlyer’s ability to architect and deliver complex, enterprise-grade solutions that solve real business problems. By combining deep expertise in search infrastructure, data pipeline architecture, browser automation, and modern web development, we built a platform capable of transforming how PR professionals track media coverage.
The technical challenges, from bypassing anti-scraping measures to building a performant search layer over heterogeneous data sources, required creative problem-solving and engineering excellence. The result is a robust, scalable architecture that demonstrates what’s possible when modern technologies are thoughtfully combined.
Interested in building a custom data platform? Contact HighFlyer to discuss how we can help transform your data into actionable intelligence.
Project Details
Client:
Malaysian PR Agency
Industry:
Public Relations & Media
Key Metrics:
4
Platforms Integrated
6
News Sources Crawled
<3s
Search Response Time
99.9%
System Uptime
Achievements:
- Built scalable multi-source data aggregation pipeline handling thousands of daily queries
- Implemented intelligent anti-blocking bypass mechanisms for reliable data collection
- Delivered sub-3-second search across millions of indexed documents
- Created intuitive analytics dashboard with real-time trend visualisation
Related Case Studies
Ready to Transform Your Business?
Let's discuss how our expertise can help you achieve similar results.
Contact Us TodayExplore More Case Studies
Discover how we've helped other organisations across various industries achieve their strategic objectives.