Migrate Langfuse v2 to v3
This is a big upgrade and we tried to make it as seamless as possible. Please create a GitHub Issue or contact support in case you have any questions while upgrading to v3.
Langfuse v3 (released on Dec. 6th, 2024) introduces a new backend architecture that unlocks many new features and performance improvements.
Follow this guide to:
- Understand the architectural changes and reasoning behind them.
- Learn about the other breaking changes.
- Follow the upgrade steps to successfully migrate to Langfuse v3.
Architecture Changes
This section dives into the reasoning behind the architectural changes we made for Langfuse v3. To learn more about the architecture of Langfuse v3, jump to the architecture overview.
Langfuse has gained significant traction over the last months, both in our Cloud environment and in self-hosted setups. With Langfuse v3 we introduce changes that allow our backend to handle hundreds of events per second with higher reliability. To achieve this scale, we introduce a second Langfuse container and additional storage services like S3/Blob store, Clickhouse, and Redis which are better suited for the required workloads than our previous Postgres-based setup.
In short, Langfuse v3 adds:
- A new worker container that processes events asynchronously.
- A new S3/Blob store for storing large objects.
- A new Clickhouse instance for storing traces, observations, and scores.
- Redis/Valkey for queuing events and caching data.
Comparison of the architectures
Architecture Diagram
Langfuse consists of two application containers, storage components, and an optional LLM API/Gateway.
- Application Containers
- Langfuse Web: The main web application serving the Langfuse UI and APIs.
- Langfuse Worker: A worker that asynchronously processes events.
- Storage Components:
- Postgres: The main database for transactional workloads.
- Clickhouse: High-performance OLAP database which stores traces, observations, and scores.
- Redis/Valkey cache: A fast in-memory data structure store. Used for queue and cache operations.
- S3/Blob Store: Object storage to persist all incoming events, multi-modal inputs, and large exports.
- LLM API / Gateway: Some features depend on an external LLM API or gateway.
Langfuse can be deployed within a VPC or on-premises in high-security environments. Internet access is optional. See networking documentation for more details.
Reasoning for the architectural changes
1. Why Clickhouse
We made the strategic decision to migrate our traces, observations, and scores table from Postgres to Clickhouse. Both us and our self-hosters observed bottlenecks in Postgres when dealing with millions of rows of tracing data, both on ingestion and retrieval of information. Our core requirement was a database that could handle massive volumes of trace and event data with exceptional query speed and efficiency while also being available for free to self-hosters.
Limitations of Postgres
Initially, Postgres was an excellent choice due to its robustness, flexibility, and the extensive tooling available. As our platform grew, we encountered performance bottlenecks with complex aggregations and time-series data. The row-based storage model of PostgreSQL becomes increasingly inefficient when dealing with billions of rows of tracing data, leading to slow query times and high resource consumption.
Our requirements
- Analytical queries: all queries for our dashboards (e.g. sum of LLM tokens consumed over time)
- Table queries: Finding tracing data based on filtering and ordering selected via tables in our UI.
- Select by ID: Quickly locating a specific trace by its ID.
- High write throughput while allowing for updates. Our tracing data can be updated from the SKDs. Hence, we need an option to update rows in the database.
- Self-hosting: We needed a database that is free to use for self-hosters, avoiding dependencies on specific cloud providers.
- Low operational effort: As a small team, we focus on building features for our users. We try to keep operational efforts as low as possible.
Why Clickhouse is great
- Optimized for Analytical Queries: ClickHouse is a modern OLAP database capable of ingesting data at high rates and querying it with low latency. It handles billions of rows efficiently.
- Rich feature-set: Clickhouse offers different Table Engines, Materialized views, different types of Indices, and many integrations which helps us to build fast and achieve low latency read queries.
- Our self-hosters can use the official Clickhouse Helm Charts and Docker Images for deploying in the cloud infrastructure of their choice.
- Clickhouse Cloud: Clickhouse Cloud is a database as a SaaS service which allows us to reduce operational efforts on our side.
When talking to other companies and looking at their code bases, we learned that Clickhouse is a popular choice these days for analytical workloads. Many modern observability tools, such as Signoz or Posthog, as well as established companies like Cloudflare, use Clickhouse for their analytical workloads.
Clickhouse vs. others
We think there are many great OLAP databases out there and are sure that we could have chosen an alternative and would also succeed with it. However, here are some thoughts on alternatives:
- Druid: Unlike Druid’s modular architecture, ClickHouse provides a more straightforward, unified instance approach. Hence, it is easier for teams to manage Clickhouse in production as there are fewer moving parts. This reduces the operational burden especially for our self-hosters.
- StarRocks: We think StarRocks is great but early. The vast amount of features in Clickhouse help us to remain flexible with our requirements while benefiting from the performance of an OLAP database.
Building an adapter and support multiple databases
We explored building a multi-database adapter to support Postgres for smaller self-hosted deployments. After talking to engineers and reviewing some of PostHog’s Clickhouse implementation, we decided against this path due to its complexity and maintenance overhead. This allows us to focus our resources on building user features instead.
2. Why Redis
We added a Redis instance to serve cache and queue use-cases within our stack. With its open source license, broad native support my major cloud vendors, and ubiquity in the industry, Redis was a natural choice for us.
3. Why S3/Blob Store
Observability data for LLM application tends to contain large, semi-structured bodies of data to represent inputs and outputs. We chose S3/Blob Store as a scalable, secure, and cost-effective solution to store these large objects. It allows us to store all incoming events for further processing and acts as a native backup solution, as the full state can be restored based on the events stored there.
4. Why Worker Container
When processing observability data for LLM applications, there are many CPU-heavy operations which block the main loop in our Node.js backend, e.g. tokenization and other parsing of event bodies. To achieve high availability and low latencies across client applications, we decided to move the heavy processing into an asynchronous worker container. It accepts events from a Redis queue and ensures that they are eventually being upserted into Clickhouse.
Other Breaking Changes
If you use Langfuse SDKs above version 2.0.0 (released Dec 2023), these changes will not affect you. The Langfuse Team has already upgraded Langfuse Cloud to v3 without any issues after helping a handful of teams (less than 1% of users) to upgrade the Langfuse SDKs.
SDK Requirements
SDK v1.x.x is no longer supported. While we aim to keep our SDKs and APIs fully backwards compatible, we have to introduce backwards incompatible changes with our update to Langfuse Server v3. Certain APIs in SDK versions below version 2.0.0 are not compatible with our new backend architecture.
Release dates of SDK v2
- Langfuse Python SDK v2 was released on Dec 17, 2023,
- Langfuse JavaScript SDK v2 was released on Dec 18, 2023.
Upgrade options if you are on SDK version 1.x.x
- Default SDK upgrade: Follow the 1.x.x to 2.x.x upgrade path (Python, JavaScript). For the JavaScript SDK, consider an upgrade from 2.x.x to 3.x.x as well. The upgrade is straightforward and should not take much time.
- Optionally switch to our new integrations: Since the first major version, we built many new ways to integrate your code with Langfuse such as Decorators for Python. We would recommend to check out our quickstart to see whether there is a more convenient integration available for you.
Background of this change
Langfuse v3 relies on an event driven backend architecture. This means, that we acknowledge HTTP requests from the SDKs, queue the HTTP bodies in the backend, and process them asynchronously. This allows us to scale the backend more easily and handle more requests without overloading the database. The SDKs below 2.0.0 send the events to our server and expect a synchronous response containing the database representation of the event. If you rely on this data and access it in the code, your SDK will break as of Nov. 11th, 2024 for the cloud version and post-upgrade to Langfuse v3 when self-hosting.
API Changes
POST /api/public/ingestion
The /api/public/ingestion
endpoint is now asynchronous.
It will accept all events as they come in and queue them for processing before returning a 207 status code.
This means that events will not be available immediately after acceptance by the backend and instead will appear
eventually in subsequent read requests.
The individual events accepted a metadata
property within their body of type string | string[] | Record<string, unknown>
.
Only the Record
type is supported within our UI and endpoints to perform queries and filter events.
Therefore, we enforce an object type for metadata
going forward.
All incoming events with { metadata: string | string[] }
will have their metadata mapped to an object with key metadata
,
i.e. { event: { body: { metadata: "test" } } }
will be transformed to { event: { body: { metadata: { metadata: "test" } } } }
.
POST /api/public/scores
The /api/public/scores
endpoint is now asynchronous.
It behaves exactly as the /api/public/ingestion
endpoint, but will return a 200 status code with a body of { id: string }
type.
Before, the endpoint returned the created score object directly.
This change is inline with our API reference and therefore not considered breaking.
Deprecated endpoints
The following endpoints are deprecated since our v2 release and thereby have not been used by the Langfuse SDKs since Feb 2024. Langfuse v3 continues to accept requests to these endpoints. Their API behavior changes to be asynchronous and the endpoints will only return the id of the created object instead of the full updated record. Please note that these endpoints will be removed in a future release.
- POST /api/public/events
- POST /api/public/generations
- PATCH /api/public/generations
- POST /api/public/spans
- PATCH /api/public/spans
- POST /api/public/traces
UI Behavioral Changes
Trace Deletion
Deleting traces within the Langfuse UI was a synchronous operation and traces got removed immediately. Going forward, all traces will be scheduled for deletion, but may still be visible in the UI for a short period of time.
Project Deletion
Deleting projects within Langfuse was a synchronous operation and projects got removed immediately. Projects will be marked as deleted within Langfuse v3 and will not be accessible using the standard UI navigation options. We immediately revoke access keys for projects, but all remaining data will be removed in the background.
Information will not be deleted from the S3/Blob Store. This action needs to be performed manually by an administrator. This process will be automated in a future release.
Migration Steps
We tried to make the version upgrade as seamless as possible. If you encounter any issues please reach out to support or open an issue on GitHub.
By following this guide, you can upgrade your Langfuse v2 deployment to v3 without prolonged downtime.
Video Walkthrough
Before you start the upgrade
Before starting your upgrade, make sure you are familiar with the contents of the respective v3 deployment guide. In addition, we recommend that you perform a backup of your Postgres database before you start the upgrade. Also, ensure that you run a recent version of Langfuse, ideally a version later than v2.92.0.
For a zero-downtime upgrade, we recommend that you provision new instances of the Langfuse web and worker containers and move your traffic after validating that the new instances are working as expected.
If you go for the zero-downtime upgrade, we recommend to disable background
migrations until you shift traffic to the new instances. Otherwise, the
migration may miss events that are ingested after the new instances were
started. Set LANGFUSE_ENABLE_BACKGROUND_MIGRATIONS=false
in the environment
variables of the new Langfuse web and worker containers until traffic is
shifted. Afterwards, remove the overwrite or set to true
.
Upgrade Steps
1. Provision new infrastructure
Ensure that you deploy all required storage components (Clickhouse, Redis, S3/Blob Store) and have the connection information handy. You can reuse your existing Postgres instance for the new deployment. Ensure that you also have your Postgres connection details ready.
The following new environment variables are required for the new Langfuse web and worker containers. If you do not provide them, the deployment will fail.
CLICKHOUSE_URL
CLICKHOUSE_USER
CLICKHOUSE_PASSWORD
CLICKHOUSE_MIGRATION_URL
REDIS_CONNECTION_STRING
(or its alternatives)LANGFUSE_S3_EVENT_UPLOAD_BUCKET
2. Start new Langfuse containers.
Deploy the Langfuse web and worker containers with the settings from our self-hosting guide. Ensure that you set the environment variables for the new storage components and the Postgres connection details.
At this point, you can start to test the new Langfuse instance. The UI should load as expected, but there should not be any traces, observations, or scores. This is expected, as data is being read from Clickhouse while those elements still reside in Postgres.
3. Shift traffic from v2 to v3
Point the traffic to the new Langfuse instance by updating your DNS records or your Loadbalancer configuration. All new events will be stored in Clickhouse and should appear within the UI within a few seconds of being ingested.
4. Wait for historic data migration to complete
We have introduced background migrations as part of the migration to v3. Those allow Langfuse to schedule longer-running migrations without impacting the availability of the service. As part of the v3 release, we have introduced four migrations that will run once you deploy the new stack.
- Cost backfill: We calculate costs for all events and store them in the Postgres database. Before, those were calculated on read which had a negative impact on read performance.
- Traces migration: We migrate all traces in batches from Postgres to Clickhouse. We start with most recent traces, i.e. those should show within your dashboard soon after starting the migration.
- Observations migration: We migrate all observations in batches from Postgres to Clickhouse. We start with most recent observations, i.e. those should show within your dashboard soon after starting the migration.
- Scores migration: We migrate all scores in batches from Postgres to Clickhouse. We start with most recent scores, i.e. those should show within your dashboard soon after starting the migration.
Each migration has to finish, before the next one starts. Depending on the size of your event tables, this process may take multiple hours.
5. Stop the old Langfuse containers
After you have verified that new events are being stored in Clickhouse and are shown in the UI, you can stop the old Langfuse containers.
Support
If you experience any issues, please create an issue on GitHub or contact the maintainers (support).
For support with production deployments, the Langfuse team provides dedicated enterprise support. To learn more, reach out to enterprise@langfuse.com or schedule a demo.
Alternatively, you may consider using Langfuse Cloud, which is a fully managed version of Langfuse. You can find information about its security and privacy here.