handling large datasets

posted on January 7th, 2024

Answered

brian mulh asked on January 7, 2024

Hello,

We are working on a financial system where we need to aggregate data over a 12 month period per customer. Currently is looks like the FDS DLL will store an index in memory indefinitely. We have some concerns around scaling and storing this much data in memory per customer when using the FDS DLL. We expect the size of these datasets to be multiple GB. I have several questions related to this:
1. Is there a way the FDS DLL would be able to support storing/accessing the unaggregated data(query results) from disk and only pulling it into memory from disk when needed to build a response for the API.
2. Would it be possible to add a configuration option to allow an index to be expired from the memory cache? This would allow our customers that might not access their data often to not require constant memory usage. When they need the data it could be cached for n minutes, then be allowed to expire again.
3. Do you have any recommendations for handling such large datasets besides increasing our servers memory?
4. Is there any way to eliminate the need to load unaggregated data into memory in the .net data server by doing something like the client does with ElasticSearch. Is there a way to have the .net server to connect to elasticCache as a data source and allow ElasticSearch handle the aggregations it supports?

5 answers

Public

Solomiia Andrusiv ⋅ Flexmonster ⋅ January 9, 2024

Hello, Brian!

Thank you for reaching out to us.

As a first step, we would like you to look through our article about choosing the most suitable data source: https://www.flexmonster.com/blog/how-to-choose-the-best-data-source-to-use-with-flexmonster/. This article should guide you to all available data sources and help you determine that the Data Server as a DLL is the best option for your case.

From our side, having gathered all the information about your use case provided in all tickets, our team still thinks that the custom data source API approach would give you the most freedom. Although developing a custom data source API does require more development time, considering all the changes in the DLL logic you want to add, it is possible that developing both approaches could require the same amount of time and a custom data source API would give you more flexibility in the end.

Also, please find our comments to all your questions below:

1. Is there a way the FDS DLL would be able to support storing/accessing the unaggregated data
Please note that for now, our FDS DLL is working with a default in-memory storage. It is possible to override our storage with a custom one by implementing the IDataStorage interface and adding the functionality of loading the data from the disk.

2. Would it be possible to add a configuration option to allow an index to be expired from the memory cache?
We recommend disabling the indexes that are not in use for a while. This way, they would be automatically deleted from the in-memory storage. Implementing this approach would require adding custom code with deleting the index. Please let us know if you want to know more details about this approach.

3. Do you have any recommendations for handling such large datasets besides increasing our servers memory?
Kindly note that you have already followed most of our basic recommendations, like clearing the cached data and focusing on aggregated data instead of raw records. We can also suggest optimizing the data index size by filtering out unnecessary rows/columns if that is possible.
Operating big data sources requires a lot of memory to store the data on your server or more time to fetch the data if it is located remotely. At the same time, implementing the custom data source API protocol would give more control over how the data processing and storing logic is organized.

4. Is there a way to have the .net server to connect to elasticCache as a data source and allow ElasticSearch handle the aggregations it supports?
To be on the same page, could you please let us know if you have tried to integrate Flexmonster with Elasticsearch out of the box? Please let us know about your experience with this integration and if there are any further questions about it.

Hope you will find our answer helpful.
Feel free to reach out to us in case of any further questions.

Kind regards,
Solomiia

Public

Solomiia Andrusiv ⋅ Flexmonster ⋅ January 17, 2024

Hello, Brian!

Hope you are doing well.

Our team is wondering if you had some time to check our previous message. Could you please let us know if it was helpful in choosing the most suitable data source?

Looking forward to hearing from you.

Kind regards,
Solomiia

Public

brian mulh ⋅ January 18, 2024

Solomiia,

I tested a couple options and haven't yet found something that will support the features we need at the scale of data we have. ElasticSearch seemed promising on the amount of data we have but does not support all the features we need according to this. I also tested removing indexes that aren't being used often but we feel that is a more short term solution, rather than long term. Given we are a small team we are trying to stay away from building our own custom API given our resource constraints of our team.

Public

Solomiia Andrusiv ⋅ Flexmonster ⋅ January 19, 2024

Hello, Brian!

Thank you for your feedback.

We kindly recommend checking out a MongoDB data source. Our MongoDB Connector was created as the implementation of custom data source API, so it has all the features available for custom data source API, and it doesn't store indexes in RAM. Although MongoDB lacks the Elasticseearch performance rate, we thought MongoDB could be a compromise option between performance and all the required functionality for the case you described.
You can read more about MongoDB Connector by the link: https://www.flexmonster.com/doc/mongodb-connector/

Please let us know about your feedback on the MongoDB data source.

Looking forward to hearing from you.

Kind regards,
Solomiia

Public

Solomiia Andrusiv ⋅ Flexmonster ⋅ January 31, 2024

Hello, Brian!

Hope you are doing well.

Just checking in to ask if the MongoDB data source seems to be a possible solution for your use case.

Looking forward to hearing from you.

Kind regards,
Solomiia

Changes to Flexmonster Software License Agreement

handling large datasets

5 answers

Please login or Register to Submit Answer