(Log Intelligence & Observability)5+years of exp
- We are building a sophisticated service designed to transform raw log data into actionable answers.
- The system takes user input, fetches relevant logs from sources like Splunk, and processes them for delivery.
- As part of this flow, you will help architect intermediate caching layers and data processing pipelines to ensure fast, reliable access to distributed log data.
- While Splunk is our primary focus today, our vision is a log-agnostic platform capable of extracting and aggregating data from a variety of sources to provide a unified troubleshooting experience.
You will be responsible for the end-to-end flow of log retrieval and processing, ensuring that our users get the data they need with minimal latency.
- Pipeline Development: Build and optimize the “Input-to-Answer ” workflow: taking user input, executing Splunk queries, caching outputs, and delivering final results.
- API & Integration: Develop RESTful Flask APIs and integrate with the Splunk API across five distributed computing log layers.
- Cross-Platform Expansion: Architect the system to eventually support log extraction from other sources, including ELK (Elasticsearch/Logstash/Kibana), Graylog, and more.
- Caching & Optimization: Implement intermediate caching layers, potentially utilizing alternative log processing solutions to speed up data delivery.
- Security & Validation: Manage service-to-service authentication (JWT/Shared Secrets) and implement query validation and workload analysis.
- Python & Flask: Proficiency in Python 3.9+ and the Flask framework for microservices.
- Advanced Splunk Querying: Strong experience with Splunk Query Language (SPL) and the Splunk SDK for Python is essential. You should be comfortable writing complex searches to extract specific insights from high-volume data.
- Kubernetes Knowledge: Deep understanding of Kubernetes logs is required to effectively validate and interpret the content being analyzed.
- Troubleshooting: Proven ability to troubleshoot distributed systems using logs. You should understand how to trace an issue across multiple services and nodes.
- Testing: Strong commitment to quality, specifically requiring experience writing Integration Tests and unit tests using Pytest.
- Containerization: Experience with Docker and Kubernetes/Helm for deployment.
Preferred Qualifications (Nice-to-Have)
- Distributed Systems: Hands-on experience with Spark or Flink. Since our log-fetching mechanisms are part of a distributed architecture, this experience is highly valuable for understanding the data lifecycle.
- Multi-Stack Experience: Familiarity with ELK, Graylog, ClickHouse, or OpenSearch, as we plan to integrate these into our extraction engine.
- Advanced Observability: Experience with platform engineering or advanced log management (index management, retention policies).
- Specialized Tooling: Familiarity with internal tools such as Whisper, Mosaic, or Rio.
Tech Stack: Python, Splunk, Kubernetes, Flask API, Integration Testing