
A New Foundation for Network Quality Analytics
In 5G projects, it’s easy to focus on what’s visible: radio metrics, coverage maps, reports, and service quality indicators. In practice, however, whether an analytics platform can “carry” future growth is determined by purely non-functional parameters: scalability, fault tolerance, environment reproducibility, and observability. If these elements are not designed and verified, growth in data volume and user numbers will quickly expose architectural limits—usually at the least desirable moment.
In this article, we present the results of work carried out under project FEMA.01.01-IP.01-02E5/24-00, titled “Optimization and improvement of a platform for analysis and quality assessment of services in next-generation 5G mobile networks”, implemented since January 2025 (planned completion: April 2027), co-financed by the European Union (EU contribution: PLN 2,583,589.29).
The goal of the project is to develop—through research—innovative functionalities and modules for the RFBENCHMARK CrowdSource platform, including a new microservices architecture, new methods for map-based visualization of results, binary HEX packet decoders, and tools that automate software development and deployment.
Task 1 within this project (Containerization of microservices along with transforming stand-alone solutions to cloud) was planned and conducted as industrial research in the area of platform architecture and operations for data platforms. Its purpose was not to “deploy a new technology,” but to develop and experimentally validate a new runtime model for the RFBENCHMARK CrowdSource platform—one that enables a transition from a stand-alone approach to a cloud-native microservices architecture. This included full monitoring, deployment automation (Infrastructure as Code), and preparing an approach to data migration without impacting the client layer.
The key point was to ensure the solution works not only on paper, but proves real functionality through a set of test scenarios and metrics answering specific research questions.
Questions and hypotheses
We adopted a work model based on verifiable hypotheses. The most important questions posed at the beginning were:
Is it possible to develop a runtime architecture in which the platform is horizontally scalable and resilient to failures of individual nodes, while maintaining service continuity and without forcing changes on the client applications?
Can environments be built in a repeatable and auditable way (IaC), so that deployment and recovery are a process—not a manual operation?
Finally: can we prepare a real data migration scheme from a NoSQL model to a relational one, preserving API-layer compatibility (Parse Server) and data quality control?
To answer these questions credibly, we needed to build an environment, simulate load and failure conditions, collect metrics, and compare system behavior in operationally relevant scenarios.
From legacy diagnosis to experiments on the target architecture
The research started with analysis of the baseline state. This is a stage that many projects shorten significantly, which later causes unpleasant surprises—sometimes even after deployment, to the frustration of end users.
Our many years of experience in building innovative telecommunications solutions has taught us that in R&D projects, baseline analysis is essential: without measurement and identification of critical points, it is impossible to define the next stages well.
In the diagnostic phase, we analyzed logs and system behavior under load, with particular emphasis on data-layer performance and operational resilience. We defined test scenarios including, among others: mass write batches, analytical queries (including geospatial), mixed read/write load profiles, and an I/O stress scenario related to parallel export/backup and data processing.
The experiments produced measurable observations that directly justified architectural decisions. For example, we identified situations where export/backup operations and write operations competed for disk resources, leading to increased latency (p95 exceeded 1.2 s). We also observed an increased frequency of 5xx errors correlated with peaks in CPU and I/O load. These observations do not indicate “application problems” themselves—they are a signal that the environment needs standardized observability and an architecture capable of distributing load, reducing SPOFs, and quickly reconverging after failures.
In the next step, we designed the target architecture: a distributed microservices platform with a runtime layer based on container orchestration (Docker Swarm), a standard traffic routing and termination gateway (reverse proxy), a consistent telemetry model (Prometheus + Alertmanager + Grafana), and dedicated secrets management (HashiCorp Vault). This phase was constructive and research-driven: the objective was not to “pick a popular stack,” but to build an environment that enables experiments and answers questions about availability, reproducibility, and service stability.
Cluster, quorum, and behavior under failure conditions
One of the key elements of Task 1 was validating cluster behavior in failure scenarios—because this is where expectations and reality most often diverge. In the target environment, we applied manager/worker node roles and the quorum mechanism characteristic of orchestration. We then prepared failure and recovery scenarios.
The research confirmed that when simulating a restart of one of the managing nodes, the remaining nodes regained quorum in under 10 seconds, without stopping services and without the need for manual “rescue” of cluster state. Importantly, the monitoring system continued collecting data locally, and metrics federation resumed automatically after about 30 seconds. This had practical significance: it showed that the platform does not depend on a single control point operationally, and that reconvergence mechanisms work within a time acceptable for maintenance.
Monitoring as an architectural component—not an “extra chart”
We assumed that monitoring cannot be “added at the end.” Therefore, part of the research work was devoted to designing a telemetry model that answers real operational questions: where queues build up, whether API performance degrades in correlation with I/O, how services behave over time, whether database replication stays within an acceptable window, and whether there are symptoms of impending failures (e.g., disk space, number of connections, CPU load, disk latency).
The telemetry layer was organized into three levels: host metrics, container metrics, and data/service-layer metrics (e.g., Postgres Exporter, application metrics). This enabled not only “data collection,” but also event correlation: comparing API latency (p50/p95/p99), load, and database metrics with the behavior of individual microservices.
Preparing the migration approach
Task 1 also covered data aspects. In platforms collecting measurement data, the challenge is not simply changing the database engine, but maintaining semantic consistency and compatibility with the API layer—especially when there are mobile apps and integrations that cannot be “switched over” overnight.
As part of the work, we prepared a migration scheme from MongoDB to PostgreSQL, while maintaining a mediation layer based on Parse Server. We identified the data class structure and the dominant share of measurement data, then designed an approach in which a new Parse Server instance can operate on a relational data model in a way that is as transparent to the client as possible. This is the area where research most strongly depends on compatibility experiments, CRUD tests, consistency verification, and assessing performance impact under typical load profiles.
IaC and “reproducibility” as a measurable outcome
Another axis of Task 1 was developing the environment in the Infrastructure as Code paradigm. In practice, this means the environment can be built and restored from versioned definitions, and the operator does not configure the platform manually but oversees an automated process. In industrial research, IaC becomes a tool for ensuring experiment repeatability: if the environment is reproducible, test results can be compared without the risk that “the configuration is different.”
Results of validation experiments: stack stability and service behavior
After building the environment and instrumenting monitoring, we executed a set of validation experiments. The research confirmed stable operation of the application stack over a 7-day observation period, with no container restarts. The health endpoint returned correct status across all environments, Parse Dashboard remained available, and CRUD operations as well as measurement data processing could be performed in parallel. Additionally, from an observability perspective, no CPU or memory anomalies were recorded in monitoring system logs, indicating correct configuration and stable operating parameters.
On the data layer, we confirmed PostgreSQL replication in logical mode, and failover tests showed no loss of write operations. This is an important result because it demonstrates not only “dry-run functionality,” but system behavior in conditions that determine real service quality: node failure, cluster reconvergence, write continuity, and telemetry consistency.
What was delivered—and what does it mean for 5G?
Task 1 is not a “new feature” visible on a user’s screen. It is the foundation that—by reducing risk—enables further development. The research resulted in developing and validating an architecture that:
distributes services and eliminates critical SPOFs,
provides observability that allows diagnosing issues before they become user incidents,
enables environment reproducibility and change control through IaC, and
creates a realistic base for data migration and further development work.
In the context of 5G, this matters for one reason: the faster data scale grows and the more dynamic the changes become, the more non-functional architecture becomes a prerequisite for maintaining service quality. Task 1 was therefore the stage in which we verified—through experiments, metrics, and failure scenarios—how to prepare the architecture of a monitoring platform to meet that challenge.
Next step
Closing Task 1 means readiness for work whose effects will be more “visible” to platform users—while avoiding the risk of building on an unstable foundation. In the next stage, we will develop and improve methods of visualizing results on an online interactive map. This is where practical value is decided—at the intersection of data and presentation: whether results are clear, comparable, filterable, and interpretable in real scenarios (devices, locations, network usage profiles). From the platform user’s perspective, the key outcome of the next task is that network and mobile device quality results can be not only collected, but also understood and used operationally—directly from the map and visual layer.







