Public Relations & Media Custom Software Development, Search Infrastructure, Data Pipeline Architecture

Media Intelligence Platform: Multi-Source Content Aggregation for PR Analytics

Enterprise media intelligence platform aggregating content from video platforms, social media, forums, and regional news sources with full-text search, trend analysis, and intelligent caching.

Background

A leading Malaysian public relations agency approached HighFlyer with a complex challenge: their PR consultants were spending countless hours manually tracking media coverage and public discourse across multiple platforms for their corporate clients. With content fragmented across video platforms, social media, forums, and numerous regional news websites, gaining a comprehensive view of brand perception required laborious manual effort.

The agency needed a unified platform that could aggregate content from all these sources, provide powerful search capabilities, and visualise trends over time. This would enable their consultants to deliver data-driven insights to clients efficiently.

The Challenge

Building an enterprise-grade media intelligence platform presented several significant technical challenges:

Data Fragmentation: Content was scattered across multiple platforms—video hosting, social media, forums—plus six regional news websites, each with different data structures, APIs, and access patterns.
Data Access Complexity: Major platforms have varying API availability and access patterns, requiring platform-specific integration strategies.
Near Real-Time Requirements: PR professionals needed access to recent content with historical context, not hours-old cached data, but fresh results that could inform time-sensitive media responses.
Complex Query Requirements: Users needed to search using multiple terms simultaneously, filter by platform and date range, and view aggregated trend data across all sources.
Performance at Scale: With millions of indexed documents, the system needed to return comprehensive search results in under 3 seconds while handling concurrent users.

The Solution by HighFlyer

We architected a three-tier solution comprising a Python-based data aggregation service, an OpenSearch indexing layer, and a Next.js web application. Each component was optimised for its specific role in the data pipeline.

System Architecture

Multi-Platform Data Aggregation Service

At the heart of the platform is a FastAPI-powered backend service that orchestrates data collection from all sources. We implemented platform-specific adapters to handle each source’s unique requirements:

Video Platform Integration: We leveraged official APIs for reliable access to video metadata, channel information, and descriptions. JMESPath expressions transform the nested JSON responses into a consistent schema.

Social Media Integration: Platform-appropriate data connectors retrieve posts and engagement metrics, extracting structured data including likes, shares, and comment counts.

Forum Integration: Using OAuth-based API access, we query forum platforms for discussion threads and community content, capturing titles, authors, comment counts, and engagement scores.

News Site Crawlers: For news sites, we built web crawlers with specialised parsing strategies for each source:

SSR-based sites: Extract server-rendered data from framework data blocks
Structured data sites: Parse LD+JSON NewsArticle schema markup

All adapters run concurrently using asyncio.gather(), dramatically reducing total collection time while maintaining fault tolerance. If one platform fails, others continue unaffected.

Data Flow Diagram

Search Infrastructure

The OpenSearch cluster serves as the platform’s central nervous system, providing both storage and search capabilities:

Index Architecture: We maintain separate indices for each content type (video-results, social-results, forum-results, posts) with custom mappings optimised for each data structure. Date fields use proper timestamp types for efficient range queries.

Intelligent Caching: Rather than hitting external APIs on every search, we implemented a 6-hour cache strategy. Each indexed document carries a searchedAt timestamp and searchedQuery field. When a user searches, the system first checks if matching cached results exist within the TTL window. This reduces API costs by approximately 80% while ensuring data freshness.

Advanced Querying: Search queries use boolean constructions with boosted fields (titles weighted 2x), the AND operator for multi-term precision, and collapse queries to deduplicate results by source ID.

Aggregations: Every search returns not just matching documents but also date histogram aggregations, bucketing results by day to power trend visualisation in the frontend.

Analytics Web Application

The Next.js application provides an intuitive interface for PR professionals:

Multi-Query Search Builder: Users construct searches using a tag-based input, adding multiple search terms that are combined with AND logic. Platform checkboxes allow filtering to specific sources.

Tabbed Results View: Results are organised by platform, with each tab displaying:

A date histogram chart (Recharts) showing post volume over time
Paginated results with keyword highlighting
Direct links to source content
Engagement metrics (likes, upvotes, retweets, comments)

Saved Queries: Frequently-used query combinations can be saved and re-executed with a single click, streamlining daily monitoring workflows.

Resilient Data Fetching: The frontend uses Promise.allSettled() for API calls, ensuring partial results display even if one platform’s data is unavailable.

Technical Deep-Dive

Async Architecture for Performance

Performance was critical given the real-time nature of PR work:

Backend Parallelism: The FastAPI service uses Python’s asyncio.gather() with return_exceptions=True to run all data connectors simultaneously. A typical multi-platform search completes in 2-3 seconds rather than the 10+ seconds sequential execution would require.

Frontend Resilience: The Next.js API routes use Promise.allSettled() rather than Promise.all(), ensuring that a timeout on one platform doesn’t block results from others. Users see available results immediately while slow sources continue loading.

Bulk Indexing: New documents are indexed using OpenSearch’s bulk API, batching hundreds of documents in single requests rather than individual insertions.

Data Normalisation Pipeline

With data coming from disparate sources in vastly different formats, normalisation was essential:

JMESPath Transformations: We use JMESPath expressions to declaratively transform complex nested JSON into flat, consistent schemas. This approach is more maintainable than procedural parsing code and easier to update when source formats change.

Timestamp Unification: Different platforms return timestamps in different formats: Unix milliseconds, ISO 8601 strings, and custom date formats. The normalisation layer converts all timestamps to a consistent format before indexing.

Field Aliasing: Platform-specific field names (e.g., favorite_count vs upvote vs likes) are mapped to semantic equivalents, enabling cross-platform queries and consistent frontend rendering.

The Results

The completed platform demonstrated significant capabilities:

Dramatic Time Savings: What previously took hours of manual checking across multiple platforms could now be accomplished in seconds with a single search.
Comprehensive Coverage: The platform successfully aggregated content from four major platforms and six news sources into a unified search interface.
Historical Trend Analysis: Date histogram visualisations revealed patterns over time, providing the foundation for measuring campaign effectiveness and identifying emerging issues.
Actionable Insights: The combination of search precision, engagement metrics, and trend data provided the tools needed for data-driven recommendations.
Scalable Architecture: The system was designed to handle growing data volumes and concurrent users without performance degradation.

Technology Stack

Backend

Python with FastAPI for high-performance async API
Scrapy for web crawling
JMESPath for data transformation

Search

OpenSearch / Elasticsearch for full-text search and aggregations

Frontend

Next.js with App Router architecture
React with TypeScript
Tailwind CSS for styling
Recharts for data visualisation

Infrastructure

Docker and Docker Compose for containerisation

Conclusion

This project exemplifies HighFlyer’s ability to architect and deliver complex, enterprise-grade solutions that solve real business problems. By combining deep expertise in search infrastructure, data pipeline architecture, browser automation, and modern web development, we built a platform capable of transforming how PR professionals track media coverage.

The technical challenges—from integrating multiple data sources with varying APIs to building a performant search layer over heterogeneous data—required creative problem-solving and engineering excellence. The result is a robust, scalable architecture that demonstrates what’s possible when modern technologies are thoughtfully combined.

Interested in building a custom data platform? Contact HighFlyer to discuss how we can help transform your data into actionable intelligence.

Project Details

Client:

Malaysian PR Agency

Industry:

Public Relations & Media

Key Metrics:

Platforms Integrated

News Sources Crawled

<3s

Search Response Time

99.9%

System Uptime

Achievements:

Built scalable multi-source data aggregation pipeline handling thousands of daily queries
Implemented intelligent caching and rate limiting for reliable data collection
Delivered sub-3-second search across millions of indexed documents
Created intuitive analytics dashboard with real-time trend visualisation

Related Case Studies

TextReload: SaaS Platform Transformation

View Case Study

Ready to Transform Your Business?

Let's discuss how our expertise can help you achieve similar results.

Explore More Case Studies

Discover how we've helped other organisations across various industries achieve their strategic objectives.

View All Case Studies