The largest cloud computing conference of 2017 is now here in AWS re:Invent and it’s taking over Las Vegas in a big way. An estimated 40,000 engineers, product leads, marketers, technical architects, and expert users from around the world will be descending on The Strip, attending keynotes, bootcamps, demos, hackathons, and in-depth hands-on training sessions at the Aria, Venetian, Mirage, MGM Grand and other venues from November 27 to December 1.
Actian is a partner of AWS re:Invent 2017 and everyone is invited to visit us in the Expo Hall of The Venetian at Booth #1538 (as you enter the Hall from the front, we are in a center column of booths near the AWS Village at the far end). We’ll be at the booth (and the Welcome Reception) on these dates and times:
Tuesday, November 28: 10:30am – 3:00pm and 5:00pm – 7:00pm (Welcome Reception)
Wednesday, November 29: 10:30am – 6:00pm
Thursday, November 30: 10:30am – 6:00pm
If you’re new to Actian products, here are some of the products in our portfolio we’ll be happy to talk to you about:
Actian Vector in-memory analytics database is a consistent performance leader on the TPC-H Decision Support Benchmark over the last 5 years. We’ll also be thrilled to discuss Actian Vector – Community Edition, just launched on the AWS Marketplace.
Actian X provides a single data management platform for OLTP and analytics.
Actian NoSQLaccelerates Agile development for complex object models at Enterprise scale.
Actian DataConnect provides lightweight, enterprise class hybrid data integration.
We’ll be liveblogging our experience at the show with our new Instagram account that you can follow here (or visit @actiancorp when you get the chance). You’ll get to see AWS re:Invent from a unique perspective and learn a bit about Actian along the way. If you happen to visit and post on Instagram (or other social media), please be sure to tag us with #ActianCorp!
Not sure what to do or expect to see at AWS re:Invent when you’re not visiting the Actian booth? You can check out the Campus page to get info about what to see at each venue and how to get between venues (either through the shuttle bus or walking). An overview of the day-by-day Agenda is available here. You can also learn about all of the Keynotes, Bootcamps, Sessions, or just have some fun at the Tatonka Challenge or the Robocar Rally.
Along with the aforementioned Instagram account, you can follow us on Twitter and on LinkedIn to stay connected with what we are up to. If you fancy a job to pursue your passion in data management, data integration, and data analytics, check out our careers page and come join our team – WE’RE HIRING!
We hope you have a fantastic time at AWS re:Invent and we look forward to meeting all of you in person to learn more about Actian’s products, community and customers.
And here we go again: here is another term. We are still debating about the real meaning of Cloud and if a database fits in IaaS, PaaS, SaaS or in a combination of all three, and now we experience another wave of change.
To be fair, the term Fog Computing has been around for quite some time, but it has never enjoyed the popularity of its buddy “the Cloud”.
Is Fog Computing another pointless renaming of well known technologies and IT infrastructures? Or is it a real new thing that Systems and Database Administrators should look at, study and embrace? My intent here is not to support one side or the other, but simply to instil a few thoughts that may turn handy in understanding where the IT market is going, and more specifically where MySQL can be a good fit in some of the new trends in technology.
IoT: Internet of Things? No! IT + OT
When we talk about IoT in general, everyone agrees that it is already changing the world we are living in. Furthermore, analysts predict trillions of dollars in business for IoT, and clearly all the big high-tech companies want a large slice of the pie. Things become interesting though when we ask analysts, decision makers and engineers what is IoT, or even better, what is the implementation of IoT. The thought goes immediately to our wearables, smart phones, or devices at home: smart fridges and smart kettles are embarrassing examples of something that looks like the new seasonal fashion trend. These devices are certainly a significant part of IoT, they make ordinary people aware of IoT, but they are not what developers and administrators should [only] look at. The multi-trillion$ business predicted by analysts is a mix of smart devices that can connect together cities and rural areas, homes and large buildings, offices and manufacturing plants, mines, farms, trains, ships, cars… but also goods and even animals and human beings. All these connected elements have one thing in common: they generate a massive amount of data. This data must be collected, stored, validated, moved, analyzed… and this is not a trivial job.
Many refer to IoT as Internet of Things, but also at IIoT as Industrial Internet of Things, i.e. to this part of IoT that is related to an industrial process. In industrial processes, we add more complexity to the equation: the environment is sometimes inhospitable, intermittently accessible and unattended by operators and users (or there are literally no users). All this may also be true for non IIoT environments, the difference is that if your Fitbit runs out of power you may be disappointed, but if a sensor on an oil platform or an actuator on a train does not have power, that may be a bigger deal.
To me, IoT is clearly all of the above, with IIoT being a subset of IoT. Personally, I have a particularly different approach to IoT. In almost my entire working life I have been involved in the IT (i.e. Information Technology) side of the business, recently with databases, but previously designing and building CRM and ERP products and solutions. In my mind IoT means IT meets OT (i.e. Operational Technology) and the two technologies cannot be treated separately: they are tightly related and any product in IoT has an IT and an OT aspect to consider. It also means that OT is no longer relegated to the industrial and manufacturing world of PLC and SCADA systems, and is now widely adopted in any environment and at any level, even in what we wear or implant.
The convergence of IT and OT into IoT makes IT physical, something that has been missing in many IT solutions. We, as IT people dealing with data, tend to manage data in an abstract manner. When we consider something physical, we refer to the performance we can squeeze from the hardware where our databases reside. With OT, we need to think broadly ofthe physical world where the data come from or go to and, even more importantly, of the journey of that data in every bit of the IT and OT infrastructure.
It’s a Database! No, It’s a Router! No, It’s Both!
The journey! That is the key point. MySQLers think about data as data stored into a database. When we think about the movement of data, we refer to it in terms of data extraction or data loading. The fact is, in IoT, data has value and it must be treated and considered when stored somewhere (data at rest) and when moving from one place to another (data in motion). Moving data in IoT terms means data streaming. There are a plethora of solutions for streaming, like Kafka, RabbitMQ and many other *MQ products, but their main focus is to store and forward data, not to use it while it is in motion. The problem is, infrastructures are so complicated, with multiple layers and with too many cases where data stops while in motion, that it becomes a priority to analyse and “use” the data even when it is transiting from one component to another.
This is a call to build the next generation database, optimised for IoT, with features that go beyond the ability to store and analyse data. Data streaming and analysis of streamed data must be part of a modern database, as also highlighted by a recent Gartner report. If you are a Database Administrator, you may consider it a database with all the features of a traditional database, but with routing and streaming capabilities. If you are a Network and Systems Administrator you may consider it a router or a streaming system with database capabilities. In a way or another, the database needed for IoT must incorporate the features of a traditional database and the ones of a traditional router. Furthermore, it must take into consideration all the security aspects of data moved and stored multiple times and, even more importantly, it must provide a safe data attestation (but let’s reserve this aspect for another post).
Welcome to Fog Computing
So, here it is: Fog Computing is all of the above. Take three layers:
The Edge: where things, animals and human beings live where data is collected, or where results from analysis go.
The Cloud: where a massive amount of data is stored and analysed, where systems are reliable and scalable, and are attended by operators and administrators.
The Fog: is everything in between. It is, in oriental terms, “where the ground meets the sky”. The Fog is the layer that is still closed to the Edge, but must provide features that are typically associated to the Cloud. It is also the layer where data is collected from a vast amount of things, and is consolidated and sent to the Cloud whenever it is possible.
The term Fog Computing is so vague that for some analysts it refers to everything from the Edge of sensors and devices to regional concentrators, routers and gateways. For other analysts, Fog Computing refers only to the layer above the Edge, ie. to the gateways and routers. Personally, I like to think that the former, i.e. Edge + middle layer, offers a more practical definition of Fog Computing.
In Fog Computing, we bring the capabilities of Cloud computing into a more complex, constrained and often technically inhospitable environment. We must collect and store a large amount of data on constrained devices the size of a wristwatch, where the processing power is mostly used to operate the system and the data management is a secondary aspect. Although the power of an Edge system is increasing exponentially, we no longer have the luxury of a stable, always-on environment. It is a bit like going back 20 years or more, when we started using personal computers to manage data. It is a fascinating challenge, certainly unwelcome by lazy administrators, which brings excitement to experienced developers.
Where Is MySQL in All This?
Here is the catch: Fog Computing desperately needs databases. Products that can handle data at rest and in motion, on constrained devices, with a small footprint, databases that can maximise the use of hardware resources, are reliable and can be installed in many flavours to be almost 100% available when needed. Many NoSQL solutions are good in theory (because of the the way they manage unstructured data), but they are often too resource-hungry to compete in this environment, or they lack features that MySQL has implemented more than a decade ago. Embedded databases are on the other side of the offer, but their features are often limited, making the solutions pretty incomplete.
Sounds familiar? Edge and Fog Computing are the perfect place for MySQL, or at least for solutions based on MySQL, where more features must be added. At the moment there are no real database and data management products for Fog Computing. The current solutions are mostly based on MySQL, but they are built ad hoc and their implementation is non replicable: a situation that slows the growth of this market, making the overall cost of a solution higher than it should be.
The opportunity is huge, but also challenging. The first implementation does not have to be a new fresh product, it can be something achievable, step by step. As for more examples and real, live projects, watch this space!
If you’re an IT professional, software engineer, or software product manager, over the past few years, you’ve likely considered using modern data platforms such as Apache Hadoop; NoSQL databases like MongoDB, Cassandra, and Kudu; search databases like Solr and Elasticsearch; in-memory systems like Spark and MemSQL; and cloud data stores such as Amazon Redshift, Google BigQuery, and Snowflake. But are these modern data technologies here to stay, or are they a flash-in-the-pan with the traditional relational database still reigning supreme?
In the Spring of 2017, Zoomdata commissioned O’Reilly Media to create and execute a survey assessing the state of the data and analytics industry. The focus was on understanding the penetration of modern big and streaming data technologies, how data analytics are being consumed by users, and what skills organizations are most interested in staffing. Nearly 900 people from a diverse set of industries, as well as government and academia, responded to the survey. Below is a preview of some of the insights provided by the survey.
Modern data platforms have eclipsed relational databases as a main data source
Of course, relational databases continue to be the core of online transactional processing (OLTP) systems. However, one of the most interesting findings was that when asked about their organization’s main data sources, less than one-third of survey respondents listed the relational database, with around two-thirds selecting non-relational sources. This is a clear indication that these non-relational data platforms have firmly crossed the chasm from early adopters into mainstream use.
Of further interest is the fact that just over 40% of respondents indicated their organizations are using what could be categorized as “modern data sources” such as Hadoop, in-memory, NoSQL, and search databases as a main data source. These modern data sources are optimized to handle what is often referred to as the “three V’s” of big data: very large data volumes; high velocity streaming data; and high variety of unstructured and semi-structured data, such as text and log files.
Drilling further into the details, analytic databases (19%) and Hadoop (14%) were the two most popular non-relational sources. Analytic databases are a category of SQL-based data stores such as Teradata, Vertica, and MemSQL that typically make use of column-store and/or massively parallel processing (MPP) to greatly speed up the kinds of large aggregate queries used when analyzing data. Hadoop, as many readers know, is a software framework used for distributed storage and processing of very large structured and unstructured data sets on computer clusters built from commodity hardware.
Summary: Which is more important, the data or the algorithms? This chicken and egg question led me to realize that it’s the data, and specifically the way we store and process the data that has dominated data science over the last 10 years. And it all leads back to Hadoop.
Recently I was challenged to speak on the role of data in data science. This almost sounds like a chicken and egg problem. How can you have one without the other? But as I reflected on how to explain this it also struck me that almost everything in the press today is about advances in algorithms. That’s mostly deep learning and reinforcement learning which are driving our chatbots, image apps, and self-driving cars.
So if you are fairly new to data science, say within the last five or six years you may have missed the fact that it is and was the data, or more specifically how we store and process the data that was the single most important factor in the explosion of data science over the last decade. In fact there was a single innovation that enabled data lakes, recommenders, IoT, natural language processing, image and video recognition, AI, and reinforcement learning.
Essentially all of these areas of major innovation can be tracked back to the single enabler, NoSQL Hadoop.
It was in 2006 that Doug Cutting and his team took the proprietary work done at Google to the Apache Institute and created open source Hadoop.
Most of you will recognize that this was also the birth of the era of Big Data, because Hadoop for the first time gave us a reasonable way to store, retrieve, and analyze anything. The addition of unstructured and semi-structured data like text, speech, image, and video created the possibilities of AI that we have today. It also let us store volumes of ordinary data like web logs or big transactional files that were previously simply too messy to store.
What you may not know, and I heard Doug Cutting himself quote at this last spring’s Strata Conference in San Jose is that the addition of unstructured and semi-structured data are not the most important feature of Hadoop. The most important feature is that it allowed many ordinary computers to function as a single computer. This was the birth of Massive Parallel Processing (MPP). If it hadn’t been for MPP the hardware we have today would never have evolved and today’s data science simply would not and could not exist.
It’s interesting to track the impact that this has had on each of the major data science innovations over the last decade:
Predictive Analytics
I have personally been practicing in predictive analytics since 2001. As valuable as that discipline was becoming to any major company with a large B2C market, we were restricted to basically numerical data.
As we move through this history I’ll use this graphic to help locate the impact of the ‘data’ versus the innovation it enables. On the vertical axis we have the domains of structured through unstructured data. On the horizontal axis, a description of whether that data science technique delivers very specific insights or just more directional guidance.
For the most part, in predictive modeling we were restricted to what we could extract from RDBMS systems like a BI warehouse, or with much more effort from transactional systems. A few of our algorithms like decision trees could directly handle standardized alpha fields like state abbreviations, but pretty much everything had to be converted to numeric.
Predictive models on the other hand deliver business insights that are extremely specific about consumer behavior or the future value of a target variable. Generally, predictive models continue to deliver accurate predictions in the range of 70% to 90% accuracy about questions like who will buy or what the spot price of oil will be next month.
Data Lakes
One of the first applications of our new found compute power and flexibility was Data Lakes. These are the ad hoc repositories where you can place a lot of data without having to predefine a schema or getting IT involved. These are the data scientist’s playground where we can explore hypotheses and look for patterns without a lot of cost or time.
Data Lakes in Hadoop could be established in a matter of hours and mostly without waiting for IT to help. These really speeded up the predictive modeling process since the volume of data that could be processed was rapidly expanding thanks to MPP. It also gave us a place to begin developing our techniques for NLP and image processing.
Recommenders
Now that we could handle the volume and complexity of web logs and large transactional files, the field of recommenders took off.
Recommender insights are directional in nature but answer really important questions on the minds of non-data scientists like:
What should we buy.
What should we watch or read.
Who should we date or marry.
The evolution of Recommenders underlies all of search and ecommerce.
Natural Language Processing
As we move forward into about the last five years, the more important features of Big Data enabled by Hadoop and NoSQL have become its ability to support unstructured data and data in motion.
This is Alexa, Siri, Cortana, Google Assistant, and the thousands of chatbots that have started emerging just since 2015. NLP took several years to evolve and now requires deep learning algorithms like recurrent neural nets. Our deep learning algorithms wouldn’t be able to find these patterns without millions of data items to examine and MPP used to keep the training time within human time frames.
Chatbots, operating both in text and spoken language have emerged so rapidly over just the last three years that in 2015 only 25% of surveyed companies had heard of them, until 2017 when 75% of companies are reported to be building them.
An interesting feature emerging from NLP is that we have learned to take unstructured text and convert it to features in our predictive models alongside our traditional variables to create more accurate models.
Internet of Things (IoT)
IoT has created an industry of its own by taking the third capability of Hadoop and Big Data, the ability to process data in motion, and turning that relatively straightforward capability into an unbelievable variety of applications.
Hadoop allows us to look at and act on semi-structured data streaming inward from sensors and take action on it before it has even been stored. This leads to the capability of dramatically speeding up response time when compared to the previous paradigm of store-analyze-deploy.
IoT systems lead us back to the very accurate and specific end of the insight scale. Some of its actions can be driven by complex predictive models but others may simply compare a sensor reading to a standard value and issue a message. These can be as simple as “oh, oh, the dog has left the yard” or as sophisticated as “get a doctor to patient Jones who is about to have a heart attack in the next 5 minutes”.
Image Processing, Reinforcement Learning, and Other Deep Learning Techniques
The most emergent of our new data science capabilities are those that have been loosely branded ‘artificial intelligence’. NLP which has evolved from simple sentiment analysis and word clouds to full-fledged conversational ability should also be included in this category. Taken together they are the eyes, ears, arms and legs of our many robots including self-driving cars.
Like NLP, image processing relies in deep neural nets, mostly in the class of convolutional neural nets. Reinforcement learning is still evolving a common tool set but relies just as deeply on MPP of huge unstructured data sets.
Of course there have been other advancements but they are more in the nature of refinements. Hadoop has been largely been replaced by Spark which continues all of its prior capabilities only better and faster. CPUs used in MPP are being paired with or replaced by GPUs or FPGAs to create horizontal process scaling that allows commercial projects to take advantage of super computer speeds.
All of data science as we know it today, all of these innovations we’ve seen over the last 10 years, continues to grow out of the not-so-simple revolution in how we store and process data with NoSQL and Hadoop.
About the author: Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist since 2001. He can be reached at:
On November 16, we hosted the Modernize your Existing EDW with IBM Big SQL and Hortonworks Data Platform webinar with speakers from Hortonworks, Carter Shanklin and Roni Fontaine, and IBM, Nagapriya Tiruthani. The webinar provided an overview of how organizations can modernize their existing data warehouse solutions to easily offload data into Apache Hadoop and Apache Hive. It also provided best practices and use cases for offloading and porting workloads from Oracle, Db2 and Netezza as well as use cases for using Hive and/or Db2 Big SQL. To get access to the slides, go here.
Some great questions came across during the webinar. As promised, here is a brief capture of that Q&A.
1. Do Big SQL and fluid query have separate offerings?
No, Big SQL includes Fluid Query technology (aka Federation) to connect to remote data sources.
2. Does Big SQL support in-memory databases?
Yes, Big SQL federates to in-memory databases. SAP Hana was tested in-house using JDBC. Spark connector can also be used to access NoSQL or in-memory databases.
3. Does Big SQL have its own security implementation or does it simply use the security features in RDBMS and Hadoop?
Yes,Big SQL has Role-based access control (RBAC) which enables granular security settings on data for row filtering and column masking.
4. What is the best approach for ingesting data into Hadoop using BI SQL or Big SQL play role once the data is in Hadoop for ELT processes?
5. From the architecture of Big SQL, I noticed that it uses Slider to leverage Yarn. But Slider is going to get deprecated and so how does Big SQL run as a yarn process?
Slider project though deprecated, it is actually getting merged with YARN. Therefore, Big SQL will be integrated with YARN to handle the resources for long running processes.
6. Big SQL currently has a number of tables limitation of around 65,000. Is there a plan for Big SQL to remove that limitation?
We are exploring options to remove some of the Db2 imposed limits on Big SQL?
7. Can we use BIG SQL as an ETL tool to load data from Oracle to Hadoop?
Yes, you can use LOAD or Insert. Select to offload data from Oracle to Hadoop?
8. Do I need Big SQL if I have Hive LLAP with SQOOP/Flume/Kafka/Spark Streaming integration?
If you want to query data that is just Hadoop, Hive LLAP might be adequate. If you want to combine data by federating to different sources or run complex queries with high concurrency, Big SQL will be a better fit.
9. Why do I need Big SQL when Hive can do everything I need?
Big SQL has its own unique set of capabilities. It can federate all your data behind a single SQL engine, it is compatible with Oracle and it provides performance optimization around highly complex workloads. Hive doesn’t handle Oracle or provide federation. Hive has its own unique capabilities around EDW Optimization use cases. If federation is important to you, it is worthwhile to look at this technology to use with Hive.
10. Do Hive and Big SQL run on the same cluster?
Big SQL has an Ambari management pack. It is fully managed with the Ambari stack. You can use the management pack to deploy Big SQL to run side by side on the same cluster as Hive.
11. When would I use Hive versus Druid?
Druid is a very interesting technology. It does not have a SQL interface. We created a Hive Druid interface so can do the analytics. How do I get SQL analytics for streaming data? You can use Druid as the place to land the streaming data and use Hive as the analytics layer on top. It’s essential to use both technologies.
12. Does Druid integrate with Storm like Hive does?
Druid is typically integrated with Storm via Kafka, with Storm processing data and writing it to Kafka while Druid reads and indexes the data landed in Kafka for fast analytics. Hortonworks Data Flow (HDF) includes Streaming Analytics Manager which provides a drag-and-drop UI to make this end-to-end process simple.
13. Do I have to use Druid API or Hive API when data from historical/realtime gets loaded?
For querying data, the Hive SQL API can be used to query data across both Hive and Druid, including joins across Hive and Druid data.
14. Can Big SQL and Hive share data nodes on the same cluster? And what will be the impact?
Yes, Big SQL and Hive both run within YARN in the Hadoop cluster and can run at the same time. This will lead to a performance impact as both Hive and Big SQL will compete for CPU, memory and I/O resources. More mission-critical applications often need greater separation which can be controlled using YARN capacity management features.
Glenn Gore here, Chief Architect for AWS. I’m in Las Vegas this week — with 43K others — for re:Invent 2017. We’ve got a lot of exciting announcements this week. I’m going to check in to the Architecture blog with my take on what’s interesting about some of the announcements from an cloud architectural perspective. My first post can be found here.
The Media and Entertainment industry has been a rapid adopter of AWS due to the scale, reliability, and low costs of our services. This has enabled customers to create new, online, digital experiences for their viewers ranging from broadcast to streaming to Over-the-Top (OTT) services that can be a combination of live, scheduled, or ad-hoc viewing, while supporting devices ranging from high-def TVs to mobile devices. Creating an end-to-end video service requires many different components often sourced from different vendors with different licensing models, which creates a complex architecture and a complex environment to support operationally.
AWS Media Services Based on customer feedback, we have developed AWS Media Services to help simplify distribution of video content. AWS Media Services is comprised of five individual services that can either be used together to provide an end-to-end service or individually to work within existing deployments: AWS Elemental MediaConvert, AWS Elemental MediaLive, AWS Elemental MediaPackage, AWS Elemental MediaStore and AWS Elemental MediaTailor. These services can help you with everything from storing content safely and durably to setting up a live-streaming event in minutes without having to be concerned about the underlying infrastructure and scalability of the stream itself.
In my role, I participate in many AWS and industry events and often work with the production and event teams that put these shows together. With all the logistical tasks they have to deal with, the biggest question is often: “Will the live stream work?” Compounding this fear is the reality that, as users, we are also quick to jump on social media and make noise when a live stream drops while we are following along remotely. Worse is when I see event organizers actively selecting not to live stream content because of the risk of failure and and exposure — leading them to decide to take the safe option and not stream at all.
With AWS Media Services addressing many of the issues around putting together a high-quality media service, live streaming, and providing access to a library of content through a variety of mechanisms, I can’t wait to see more event teams use live streaming without the concern and worry I’ve seen in the past. I am excited for what this also means for non-media companies, as video becomes an increasingly common way of sharing information and adding a more personalized touch to internally- and externally-facing content.
AWS Media Services will allow you to focus more on the content and not worry about the platform. Awesome!
Amazon Neptune As a civilization, we have been developing new ways to record and store information and model the relationships between sets of information for more than a thousand years. Government census data, tax records, births, deaths, and marriages were all recorded on medium ranging from knotted cords in the Inca civilization, clay tablets in ancient Babylon, to written texts in Western Europe during the late Middle Ages.
One of the first challenges of computing was figuring out how to store and work with vast amounts of information in a programmatic way, especially as the volume of information was increasing at a faster rate than ever before. We have seen different generations of how to organize this information in some form of database, ranging from flat files to the Information Management System (IMS) used in the 1960s for the Apollo space program, to the rise of the relational database management system (RDBMS) in the 1970s. These innovations drove a lot of subsequent innovations in information management and application development as we were able to move from thousands of records to millions and billions.
Today, as architects and developers, we have a vast variety of database technologies to select from, which have different characteristics that are optimized for different use cases:
Relational databases are well understood after decades of use in the majority of companies who required a database to store information. Amazon Relational Database (Amazon RDS) supports many popular relational database engines such as MySQL, Microsoft SQL Server, PostgreSQL, MariaDB, and Oracle. We have even brought the traditional RDBMS into the cloud world through Amazon Aurora, which provides MySQL and PostgreSQL support with the performance and reliability of commercial-grade databases at 1/10th the cost.
Non-relational databases (NoSQL) provided a simpler method of storing and retrieving information that was often faster and more scalable than traditional RDBMS technology. The concept of non-relational databases has existed since the 1960s but really took off in the early 2000s with the rise of web-based applications that required performance and scalability that relational databases struggled with at the time. AWS published this Dynamo whitepaper in 2007, with DynamoDB launching as a service in 2012. DynamoDB has quickly become one of the critical design elements for many of our customers who are building highly-scalable applications on AWS. We continue to innovate with DynamoDB, and this week launched global tables and on-demand backup at re:Invent 2017. DynamoDB excels in a variety of use cases, such as tracking of session information for popular websites, shopping cart information on e-commerce sites, and keeping track of gamers’ high scores in mobile gaming applications, for example.
Graph databases focus on the relationship between data items in the store. With a graph database, we work with nodes, edges, and properties to represent data, relationships, and information. Graph databases are designed to make it easy and fast to traverse and retrieve complex hierarchical data models. Graph databases share some concepts from the NoSQL family of databases such as key-value pairs (properties) and the use of a non-SQL query language such as Gremlin. Graph databases are commonly used for social networking, recommendation engines, fraud detection, and knowledge graphs. We released Amazon Neptune to help simplify the provisioning and management of graph databases as we believe that graph databases are going to enable the next generation of smart applications.
A common use case I am hearing every week as I talk to customers is how to incorporate chatbots within their organizations. Amazon Lex and Amazon Polly have made it easy for customers to experiment and build chatbots for a wide range of scenarios, but one of the missing pieces of the puzzle was how to model decision trees and and knowledge graphs so the chatbot could guide the conversation in an intelligent manner.
Graph databases are ideal for this particular use case, and having Amazon Neptune simplifies the deployment of a graph database while providing high performance, scalability, availability, and durability as a managed service. Security of your graph database is critical. To help ensure this, you can store your encrypted data by running AWS in Amazon Neptune within your Amazon Virtual Private Cloud (Amazon VPC) and using encryption at rest integrated with AWS Key Management Service (AWS KMS). Neptune also supports Amazon VPC and AWS Identity and Access Management (AWS IAM) to help further protect and restrict access.
Our customers now have the choice of many different database technologies to ensure that they can optimize each application and service for their specific needs. Just as DynamoDB has unlocked and enabled many new workloads that weren’t possible in relational databases, I can’t wait to see what new innovations and capabilities are enabled from graph databases as they become easier to use through Amazon Neptune.
Look for more on DynamoDB and Amazon S3 from me on Monday.
On November 16, we hosted the Modernize your Existing EDW with IBM Big SQL and Hortonworks Data Platform webinar with speakers from Hortonworks, Carter Shanklin and Roni Fontaine, and IBM, Nagapriya Tiruthani. The webinar provided an overview of how organizations can modernize their existing data warehouse solutions to easily offload data into Apache Hadoop and Apache Hive. It also provided best practices and use cases for offloading and porting workloads from Oracle, Db2 and Netezza as well as use cases for using Hive and/or Db2 Big SQL. To get access to the slides, go here.
Some great questions came across during the webinar. As promised, here is a brief capture of that Q&A.
1. Do Big SQL and fluid query have separate offerings?
No, Big SQL includes Fluid Query technology (aka Federation) to connect to remote data sources.
2. Does Big SQL support in-memory databases?
Yes, Big SQL federates to in-memory databases. SAP Hana was tested in-house using JDBC. Spark connector can also be used to access NoSQL or in-memory databases.
3. Does Big SQL have its own security implementation or does it simply use the security features in RDBMS and Hadoop?
Yes,Big SQL has Role-based access control (RBAC) which enables granular security settings on data for row filtering and column masking.
4. What is the best approach for ingesting data into Hadoop using BI SQL or Big SQL play role once the data is in Hadoop for ELT processes?
5. From the architecture of Big SQL, I noticed that it uses Slider to leverage Yarn. But Slider is going to get deprecated and so how does Big SQL run as a yarn process?
Slider project though deprecated, it is actually getting merged with YARN. Therefore, Big SQL will be integrated with YARN to handle the resources for long running processes.
6. Big SQL currently has a number of tables limitation of around 65,000. Is there a plan for Big SQL to remove that limitation?
We are exploring options to remove some of the Db2 imposed limits on Big SQL?
7. Can we use BIG SQL as an ETL tool to load data from Oracle to Hadoop?
Yes, you can use LOAD or Insert. Select to offload data from Oracle to Hadoop?
8. Do I need Big SQL if I have Hive LLAP with SQOOP/Flume/Kafka/Spark Streaming integration?
If you want to query data that is just Hadoop, Hive LLAP might be adequate. If you want to combine data by federating to different sources or run complex queries with high concurrency, Big SQL will be a better fit.
9. Why do I need Big SQL when Hive can do everything I need?
Big SQL has its own unique set of capabilities. It can federate all your data behind a single SQL engine, it is compatible with Oracle and it provides performance optimization around highly complex workloads. Hive doesn’t handle Oracle or provide federation. Hive has its own unique capabilities around EDW Optimization use cases. If federation is important to you, it is worthwhile to look at this technology to use with Hive.
10. Do Hive and Big SQL run on the same cluster?
Big SQL has an Ambari management pack. It is fully managed with the Ambari stack. You can use the management pack to deploy Big SQL to run side by side on the same cluster as Hive.
11. When would I use Hive versus Druid?
Druid is a very interesting technology. It does not have a SQL interface. We created a Hive Druid interface so can do the analytics. How do I get SQL analytics for streaming data? You can use Druid as the place to land the streaming data and use Hive as the analytics layer on top. It’s essential to use both technologies.
12. Does Druid integrate with Storm like Hive does?
Druid is typically integrated with Storm via Kafka, with Storm processing data and writing it to Kafka while Druid reads and indexes the data landed in Kafka for fast analytics. Hortonworks Data Flow (HDF) includes Streaming Analytics Manager which provides a drag-and-drop UI to make this end-to-end process simple.
13. Do I have to use Druid API or Hive API when data from historical/realtime gets loaded?
For querying data, the Hive SQL API can be used to query data across both Hive and Druid, including joins across Hive and Druid data.
14. Can Big SQL and Hive share data nodes on the same cluster? And what will be the impact?
Yes, Big SQL and Hive both run within YARN in the Hadoop cluster and can run at the same time. This will lead to a performance impact as both Hive and Big SQL will compete for CPU, memory and I/O resources. More mission-critical applications often need greater separation which can be controlled using YARN capacity management features.
Join Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.
The CFP for Percona Live Santa Clara 2018 closes December 22, 2017: please consider submitting as soon as possible. We want to make an early announcement of talks, so we’ll definitely do a first pass even before the CFP date closes. Keep in mind the expanded view of what we are after: it’s more than just MySQL and MongoDB. And don’t forget that with one day less, there will be intense competition to fit all the content in.
A new book on MySQL Cluster is out: Pro MySQL NDB Cluster by Jesper Wisborg Krogh and Mikiya Okuno. At 690 pages, it is a weighty tome, and something I fully plan on reading, considering I haven’t played with NDBCLUSTER for quite some time.
Did you know that since MySQL 5.7.17, connection control plugins are included? They help DBAs introduce an increasing delay in server response to clients after a certain number of consecutive failed connection attempts. Read more at the connection control plugins.
While there are a tonne of announcements coming out from the Amazon re:Invent 2017 event, I highly recommend also reading Some data of interest as AWS reinvent 2017 ramps up by James Governor. Telemetry data from sumologic’s 1,500 largest customers suggest that NoSQL database usage has overtaken relational database workloads! Read The State of Modern Applications in the Cloud. Page 8 tells us that MySQL is the #1 database on AWS (I don’t see MariaDB Server being mentioned which is odd; did they lump it in together?), and MySQL, Redis & MongoDB account for 40% of database adoption on AWS. In other news, Andy Jassy also mentions that less than 1.5 months after hitting 40,000 database migrations, they’ve gone past 45,000 over the Thanksgiving holiday last week. Have you started using AWS Database Migration Service?
Releases
Link List
Upcoming appearances
ACMUG 2017 gathering – Beijing, China, December 9-10 2017 – it was very exciting being there in 2016, I can only imagine it’s going to be bigger and better in 2017, since it is now two days long!
Colin Charles is the Chief Evangelist at Percona. He was previously on the founding team for MariaDB Server in 2009, worked in MySQL since 2005, and been a MySQL user since 2000. Before joining MySQL, he worked actively on the Fedora and OpenOffice.org projects. He’s well known within many open source communities, and has spoken on the conference circuit.
Designed to power engaging mobile, IoT, and web applications, the enterprise-class Couchbase Data Platform includes Couchbase Server and Couchbase Mobile. Couchbase Server is a cloud-native, NoSQL database designed with a distributed architecture for performance, scalability, and availability. It enables developers to build applications by leveraging the power of SQL with the flexibility of JSON. Couchbase Mobile includes a fully integrated embedded database, built-in security, and real-time automated sync with the highly scalable Couchbase server.
The Quick Start deploys Couchbase into a new or existing infrastructure in your AWS account. It uses Amazon Machine Images (AMIs) from AWS Marketplace, and provides two subscription options: Bring Your Own License (BYOL) or hourly pricing. The deployment is automated by AWS CloudFormation templates that you can customize during launch. You can also use the templates as a starting point for your own implementation, by downloading them from the GitHub repository. The Quick Start includes a guide with step-by-step deployment and configuration instructions.
To get started with Couchbase on AWS, use these resources:
About Quick Starts Quick Starts are automated reference deployments for key workloads on the AWS Cloud. Each Quick Start launches, configures, and runs the AWS compute, network, storage, and other services required to deploy a specific workload on AWS, using AWS best practices for security and availability.
As stated in the last article, database obesity due to numerous intermediate tables and stored procedures is rooted in the closed computational system. If there is an independent computing engine providing computing capability independent of databases, then the latter can lose weight.
With a separate computing engine, the database-generated intermediate data doesn’t have to be stored as data tables; instead, it can be stored in the file system to be further computed by the computing engine. The read-only intermediate data, when stored in a file format, doesn’t need to be rewritten but maintaining its compactness and experiencing a higher compression ratio. And transaction consistency isn’t required for its access. Compared with the database, this simple storage and access mechanism enables much better I/O performance. A file system organizes data in a tree structure. It manages the intermediate data generated by different applications (or modules) by category. This is convenient and attaches the intermediate data to its application (or module) to prevent it from being accessed by other applications (or modules). When a module is changed or offline, the intermediate data it generates can be changed accordingly without worrying about the coupling problem caused by data sharing. Similarly, a stored procedure generating intermediate data can also be moved out of the database to become a part of the application to get rid of coupling problem.
Non-database-generated intermediate tables can also be reduced or eliminated. The extract and transform stages of an ETL operation can be handled outside of the database by a computing engine, and then the clean data will be loaded into the database. The first two stages don’t consume database computing resources and thus intermediate tables are not needed to store data. The database is just responsible for storing the final result.
A computing engine can deal with mixed computations for data presentation involving non-database data sources and database data, making it unnecessary to load external data into the database and thus reducing intermediate tables considerably. That the computing engine sends an ad hoc data retrieval request to the data source to get the most recent data for presentation enables a better real-time capability. But by loading data in and storing it as intermediate tables periodically, the most recent data could be missed. Leaving external data where it is helps exploiting strengths of the non-database data sources. NoSQL databases are good at data searching by key values and handle data of various structures well. A professional data computing engine with a good ability of handling multi-level data, like XML and JSON, beats conventional relational databases in phrasing computing logics.
Apart from the essential computing power, a computing engine intended to relieve database burden must possess good openness and be integration-friendly.
The concept of openness refers to the computing capability independent of any storage system. A system with open computing ability can compute data coming from any data sources, like the file system and enables choosing a suitable storage plan to organize and manage the intermediate data. But a computing system requiring a specific data storage mechanism (say the database) is the same old stuff with a different label. The concept of integrability means the computing procedure is embedded into the application to be a part of it, rather than being a separate process that is shared by multiple applications (modules). Thus application coupling won’t happen.
Measured by the two features, the Hadoop system (including Spark) isn’t suitable to work as an open computing engine though it has some computing power. It possesses a certain degree of openness to compute external data, but the performance is poor and seldom is the ability employed. A Hadoop system is huge and runs as an independent process. It lacks nearly any integrability and can’t be fully embedded into an application.
A true open and integration-friendly computing engine enables the separation of computing ability from storage strategy, making it convenient and flexible to design an application’s structure. With such a computing engine, there’s no need to deploy an additional database or scale-out the database in order to access computing power. It lets the database do the job it is best at, making the most use of the resources.
More than 2.5 quintillion bytes of data—as much as 250,000 times the printed material in the U.S. Library of Congress—come into existence every day. What this data means for the average enterprise is opportunity: the opportunity to improve fraud protection, compliance and personalization of services and products.
But first, you need to make sure you are working with the right data and that your data is consistent and clean.
While data governance itself is not a new concept, the need for significantly better data governance has grown with the volume, variety and velocity of data. With this need for better data governance has come a need for better databases. Before we get into that, let’s make sure we’re clear on what data governance is and how it’s used.
The Three Pillars of Data Governance
Data governance is the establishment of processes around data availability, usability, consistency, integrity and security, all of which fall into the three pillars of data governance.
Pillar 1: Data Stewardship
In an age when data silos run rampant and “bad data” is blamed for nearly every major strategic oversight at an enterprise, it’s critical to have someone or something at the ready to ensure business users have high-quality, consistent and easily accessible data.
Enter data stewardship and the “data steward.” A data steward ensures common, meaningful data across applications and systems. This is much easier said than done, of course, and quite often the problems with data stewardship arise from a lack of clarity or specificity around the data steward’s function, as there are many ways to approach it (i.e., according to subject area, business function, business process, etc.).
Nevertheless, properly stewarding data has become a key ability for today’s enterprises and is a key aspect of proper data governance at any organization.
Pillar 2: Data Quality
Where data governance itself is the policies and procedures around the overall management of usability, availability, integrity and security of data, data quality is the degree to which information consistently meets the expectations and requirements of the people using it to perform their jobs.
The two are, of course, very intertwined, although data quality should be seen as a natural result of good data governance, and one of the most important results that good data governance achieves.
How accurate is the data? How complete? How consistent? How compliant? These are all questions of data quality, and they are often addressed via the third pillar of data governance: master data management.
Pillar 3: Master Data Management
Master data management, or MDM, is often seen as a first step towards making the data usable and shareable across an organization. Enterprises are increasingly seeking to consolidate environments, applications and data in order to:
Increase efficiency
Reduce risks and costs
Improve compliance
Increase security and auditability of data
Improve customer satisfaction and retention
MDM is a powerful method used to achieve all of the above via the creation of a single point of reference for all data.
NoSQL Graph Database for Data Governance
Considering the recent Facebook fiasco with personal data, and with big regulations like the General Data Protection Regulation (GDPR) now in effect, it’s impossible to understate the importance of data governance.
NoSQL databases were designed with modern IT architectures in mind. They use a more flexible approach that enables increased agility for development teams, which can evolve the data models on the fly to account for shifting application requirements. NoSQL databases are also easily scalable and can handle large volumes of structured, semi-structured and unstructured data.
Graph databases can be implemented as native graphs, while non-native graph databases, which are slower, store data in relational databases or other NoSQL databases (such as Cassandra) and use graph processing engines for data access. Graph databases are well-suited for applications traversing paths between entities or where the relationship between entities and their properties needs to be queried.
This relationship-analysis capability makes them ideal for empowering solid data governance at organizations of all types and sizes. From fraud protection to compliance to getting a complete view of the customer, a NoSQL graph database makes data governance much easier and much less costly.
To learn more about how to use a NoSQL graph database for data governance, click here.
Josh Ledbetter Senior Account Executive, OrientDB, an SAP Company
Risk management is a critical task throughout the world of finance (and increasingly in other disciplines as well). It is a significant area of investment for IT teams across banks, investors, insurers, and other financial institutions. MemSQL has proven to be very well suited to support risk management and decisioning applications and analytics, as well as related areas such as fraud detection and wealth management.
In this case study we’ll show how one major financial services provider improved the performance and ease of development of their risk management decisioning by replacing Oracle with MemSQL and Kafka. We’ll also include some lessons learned from other, similar MemSQL implementations.
Starting with an Oracle-based Data Warehouse
At many of the financial services institutions we work with, Oracle is used as a database for transaction processing and, separately, as a data warehouse. In this architecture, an extract, transform, and load (ETL) process moves data between the operational database and the analytics data warehouse. Other ETL processes are also typically used to load additional data sources into the data warehouse.
The original architecture was slowed by ETL processes that ran at irregular intervals and required disparate operations skills
This architecture, while functional and scalable, is not ideal to meet the growing concurrency and performance expectations that risk management systems at financial institutions need to meet. MemSQL customers have seen a number of problems with these existing approaches:
Stale data. Fresh transaction data that analytics users want is always a batch load (into the transaction database), a transaction processing cycle, and an ETL process away from showing up in the OLAP database.
Variably aged data. Because there are different data sources with different processing schedules, comprehensive reporting and wide-ranging queries might have to wait until the slowest process has had a chance to come up to date.
Operational complexity. Each ETL process is its own hassle, taking up operators’ time, and confusingly different from the others.
Fragility. With multiple processes to juggle, a problem in one area causes problems for all the analytics users.
Expense. The company has too many expensive contracts for databases and related technology and needs too many people with varied, specialized skills in operations.
What’s Needed in a Database Used for Risk Management
The requirements for a database used to support risk management are an intensification of the requirements for other data-related projects. A database used for risk management must power a data architecture that is:
Fast. Under intense regulatory pressure, financial services companies are responsible for using all the data they have in their possession, now. Slow answers to questions are not acceptable.
Up-to-date. The common cycle of running data through an OLTP database, an ETL process, and into an OLAP database / data warehouse results in stale data for analytics. This is increasingly unacceptable for risk management.
Streaming-ready. There is increasing pressure for financial services institutions to stream incoming data into and through a database for immediate analytics availability. Today, Kafka provides the fast connections; databases must do their part to process data and move it along smartly.
High concurrency. Top management wants analytics visibility across the entire company, while more and more people throughout the company see analytics as necessary for their daily work. This means that the database powering analytics must support large numbers of simultaneous users, with good responsiveness for all.
Flexible. A risk management database may need to be hosted near to where its incoming data is, near to where its users are, or a combination. So it should be able to run in any public cloud, on premises, in a container or virtual machine, or in a blended environment to mix and match strengths, as needed to meet these requirements.
Scalable. Ingest and processing requirements can grow rapidly in any part of the data transmission chain. A database must be scalable so as to provide arbitrarily large capacity wherever needed.
SQL-enabled. Scores of popular business intelligence tools use SQL, and many users know how to compose ad hoc queries in SQL. Also, SQL operations have been optimized over a period of decades, meaning a SQL-capable database is more likely to meet performance requirements.
Two important capabilities for a risk management system highlight the importance of these valuable characteristics in the database driving the risk management system.
The first area is the need for pre-trade analysis. Traders want active feedback to their queries about the risk profile of a trade. They – and the organization – also need background analysis and alerting for trades that are unusually risky, or beyond a pre-set risk threshold.
Pre-trade analysis is computationally intense, but must not slow other work. (See “fast” and “high concurrency” above.) This analysis can be run as a trade is executed, or can be run as a precondition to executing the trade – and the trade can be flagged, or even held up, if the analysis is outside the organization’s guidelines.
What-if analysis – or its logical complement, exposure analysis – is a second area that is highly important for risk management. An exposure analysis answers questions such as, “What is our direct exposure to the Japanese yen?” That is, what part of our assets are denominated in yen?
It’s equally important to ask questions about indirect exposure – all the assets that are affected if the yen’s value moves strongly up or down. With this kind of analysis, an organization can avoid serious problems that might arise if its portfolios, as a group, drift too strongly into a given country, currency, commodity, and so on.
A what-if analysis addresses these same questions, but makes them more specific. “What if the yen goes up by 5% and the Chinese renmibi drops by 2%?” This is the kind of related set of currency movements that might occur if one country’s economy heats up and the other’s slows down.
These questions are computationally intense, require wide swaths of all the available data to answer – and must be able to run without slowing down other work, such as executing trades or powering real-time analytics dashboards. MemSQL characteristics such as speed, scalability, and support for a high degree of concurrency allow these risk management-specific needs to be addressed smoothly.
Improving Performance, Scale, and Ease of Development with MemSQL
Oracle, and other legacy relational databases, are relatively slow. They can only serve as an OLTP or OLAP database (not both in one); they do not support high concurrency or scale without significant added cost and complexity; and they require specialized hardware for acceleration. These legacy relational databases are also very expensive to license and operate compared to modern databases.
Oracle has worked to address many of these problems as their customers’ needs have changed. The scalability requirement for its single node architecture foundation can be partly met by scaling up — albeit to a massively expensive and hard to manage system, Exadata. Oracle also meets the SQL requirement, which gives it an advantage over NoSQL systems – but not over modern “NewSQL” databases like MemSQL.
After due consideration, the customer chose to move their analytics support from an Oracle data warehouse to an operational database running MemSQL.
The Solution: A Database for Operational Analytics
To address the challenges with an Oracle-centric legacy architecture, one company we work with decided to move to operational analytics. An operational approach to analytics puts all the data that’s needed by the company on an ongoing basis into a single data store and makes it available for rapid, ongoing decision-making.
This approach also seeks to reduce the lag time from the original creation of a data item to its reflection in the operational data store. As part of this effort, all messaging between data sources and data stores is moved to a single messaging system, such as Apache Kafka. ETL processes are eliminated where possible, and standardized as loads into the messaging system where not.
The operational data store does a lot – but not everything. It very much supports ad hoc analytics queries, reporting, business intelligence tools, and operational uses of machine learning and AI.
What it doesn’t do is store all of the data for all of the time. There are cost, logistical, and speed advantages to not have all potentially relevant company data kept in the operational data store.
Non-operational data is either deleted or – an increasingly common alternative – batch loaded into a data lake, often powered by Hadoop/HDFS, where it can be stored long-term, and also plumbed as needed by data scientists.
The new architecture has fast and robust support for analytics and data science users
The data lake also serves a valuable governance function by allowing the organization to keep large amounts of raw or lightly processed data, enabling audits and far-reaching analytical efforts to access the widest possible range of data, without interfering with operational requirements.
MemSQL is well suited for operational analytics. MemSQL features fast ingest via its Pipeline features. It can also handle transactions on data coming in via the Pipeline – either directly, for lighter processing, or through the use of Pipelines to stored procedures for more complex work. Stored procedures add capability to the ingest and transformation process.
MemSQL can support data ingest, transactions, and queries against the operational data store, all running at the same time. Because it’s a distributed system, MemSQL can scale out to handle as much data ingest, transformational processing, and query traffic as needed.
A separate instance of MemSQL can also be used for the data lake, but that function is more often handled by Hadoop/HDFS or another system explicitly designed as a data lake.
Implementing MemSQL as an Operational Data Warehouse
The financial services company described above wanted to significantly improve their portfolio risk management capabilities, as well as other analytics capabilities. They also wanted to support both real-time operational use and research use of machine learning and AI.
In support of these goals, the company implemented an increasingly common architecture based on three modern data tools:
Messaging with Apache Kafka. The company standardized on Kafka for messaging, speeding data flows and simplifying operations.
Analytics database consolidation to MemSQL. A single data store running on MemSQL was chosen as the engine and source of truth for operational analytics.
Standalone data lake with Apache Hadoop. The data lake was taken out of the operational analytics flow and used to store a superset of the operational data.
As you can see, the core of the architecture became much simpler after the move to MemSQL as an operational data warehouse. The architecture is made up of four silos.
Inputs
Each operational system, every external data source, and each internal source of behavioral data outputs to the same destination – a data streaming cluster running a Kafka-based streaming platform from Confluent.
Streaming Data Ingestion
The data streaming cluster receives all inputs and data to two different destinations:
Operational data warehouse. Most of the data goes to the operational data warehouse.
Data science sandbox. Some structured and semi-structured data goes to the data science sandbox.
Hadoop/HDFS. All of the data is sent to Hadoop/HDFS for long-term storage.
Data Stores
MemSQL stores the operational data warehouse and the data science sandbox. Hadoop/HDFS holds the data lake.
Queries
Queries come from several sources: ad hoc SQL queries; business apps; Tableau, the company’s main business intelligence tool; Microsoft Excel; SAS, the statistics tool; and data science tools.
Benefits of the Updated Data Platform
The customer who implemented risk management and other analytics, moving from ETL into Oracle to Kafka, MemSQL, and Hadoop, achieved a wide range of benefits.
They had begun with nightly batch loads for data, but needed to move to more frequent, intraday updates – without causing long waits or delays in analytics performance. For analytics, they needed sub-second response times for dozens of queries per second.
With MemSQL, the customer was able to load data in as soon as it became available. This led to better query performance, with query results that include the latest data. The customer has achieved greater performance, more uptime, and simpler application development. Risk managers have access to much more recent data.
Risk management users, analytics users overall, and data scientists share in a wide range of overall benefits, including:
Reduction from Oracle licensing costs
Reduced costs due to less need for servers, compute cores, and RAM
Fresher data – new data available much faster
Less coding for new apps
Lower TCO
Cloud connectivity and flexibility
Reduction in operations costs
Elimination of maintenance costs for outmoded batch apps
More analytics users supported
Faster analytics results
Faster data science results
New business opportunities
Why MemSQL for Risk Management?
MemSQL is fast – with the ability to scan up to one trillion rows per second. It’s a distributed SQL database, fully scalable. MemSQL supports streaming, in combination with messaging platforms such as Apache Kafka, and supports exactly-once guarantees. MemSQL supports high levels of concurrency and runs everywhere – on premises or in the cloud, in containers or virtual machines.
MemSQL customers often begin by moving some or all of their analytics to MemSQL for better responsiveness, greater concurrency, and reduced costs for the platform – including software licensing, hardware requirements, and operations expenses.
Customers then tend to find that MemSQL can take over more and more of the data pipeline. The combination of Kafka for messaging, MemSQL for data processing, Hadoop/HDFS as a data lake, and BYOBI (bring your own business intelligence, or BI, tools), can serve as a core architecture for a wide range of data analytics needs.
You can try MemSQL today for free. Or, contact us to speak with a technical professional who can describe how MemSQL can help you achieve your goals.
A customer recently asked for a TTL feature in MySQL. The idea is to automatically delete rows from a certain table after a defined lifespan, e.g. 60 seconds. This feature is common in many NoSQL databases, but it is not available in MySQL. However MySQL offers all you need to implement this. And due to partitioning much more efficient than only deleting rows. Let’s test it.
tl;dr
Partition the table and truncate partitions in a regular event procedure, that does the trick and comes at a fraction of the cost of regularly deleting rows.
The test case
The table needs a column to keep track of row age. This can be either a “created_at” column or an “expires_at” column. (“expires_at” has the additional advantage that each row can have an individual lifespan. Not possible in many NoSQL solutions.)
So my table is
CREATE TABLE `t` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT PRIMARY KEY,
`created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`content` varchar(42) DEFAULT NULL,
);
I tested two variants to implement a 10 seconds TTL on my table “t”:
The simple solution
Run an event every 10 seconds to delete rows that have been created more than 10s ago.
DELIMITER |
CREATE EVENT ttl_delete
ON SCHEDULE EVERY 10 SECOND STARTS ‘2019-03-04 16:00:00’ DISABLE
DO BEGIN
DELETE FROM t WHERE created_at < NOW() - INTERVAL 10 SECOND;
END |
DELIMITER ;
And index on “created_at” might improve performance for the DELETE job. But in any case it is quite expensive to scan the table and remove roughly 50% of the rows of a table, at least if the INSERT rate is high.
The efficient solution
Instead of DELETing we can use the much faster TRUNCATE operation. Obviously we do not want to TRUNCATE the whole table but if we distribute the inserted rows into partitions it is safe to truncate any partition that contains outdated rows. Let’s define three partitions (or buckets): One that is currently being written to, one that holds rows of the last 10 seconds and one partition that can be truncated because the rows are older than 10 seconds. Key is to calculate the bucket from the current time. This can be done with the expression FLOOR(TO_SECONDS(NOW()/10)) % 3, or more generic FLOOR(TO_SECONDS(NOW()/ttl))% number_of_buckets
Now we can partition the table by this expression. For that we add a generated column to calculate the bucket from the column “created_at” and partition the table by column “bucket”. The table now looks like this: CREATE TABLE `t` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, `content` varchar(42) DEFAULT NULL, `bucket` tinyint(4) GENERATED ALWAYS AS (floor(TO_SECONDS(`created_at`) / 10) % 3) STORED NOT NULL, PRIMARY KEY (`id`,`bucket`) ) PARTITION BY LIST (`bucket`) (PARTITION p0 VALUES IN (0), PARTITION p1 VALUES IN (1), PARTITION p2 VALUES IN (2));
And the event procedure is like this: DELIMITER | CREATE EVENT ttl_truncate ON SCHEDULE EVERY 10 SECOND STARTS ‘2019-03-04 16:00:00’ DISABLE DO BEGIN CASE FLOOR(TO_SECONDS(NOW())/10)%3 WHEN 0 THEN ALTER TABLE test.t TRUNCATE PARTITION p1; WHEN 1 THEN ALTER TABLE test.t TRUNCATE PARTITION p2; WHEN 2 THEN ALTER TABLE test.t TRUNCATE PARTITION p0; END CASE; END| DELIMITER ;
Watching the rows come and go
To verify that the procedure works as expected I created a small monitor procedure that displays each second the number of rows per partition. Then it is easy to follow in which partition data is currently added and when a partition gets truncated. DELIMITER | CREATE PROCEDURE monitor() BEGIN WHILE 1=1 DO SELECT “p0” AS “part”, count(*) FROM t PARTITION (p0) UNION SELECT “p1”, count(*) FROM t PARTITION (p1) UNION SELECT “p2”, count(*) FROM t PARTITION (p2); SELECT now() AS “NOW”, floor(to_seconds(now())/10)%3 AS “Bucket”; SELECT sleep(1); END WHILE; END| DELIMITER ;
This procedure is not ideal. Too many count(*) will create quite some locking. But it is accurate. The alternative is to read data from INFORMATION_SCHEMA.partitions, but this does not give the exact row count, which I needed for verification.
Increasing Accuracy
If TTL is 10 seconds, deleting or truncating every 10 seconds means you have at least 10 seconds of rows available. In reality you will have 10 to 20 seconds worth of data, so on average 15 seconds (assuming a constant INSERT rate). If you run the cleaner job more often (say once per second) the average number of rows is 10.5 seconds worth of data. This comes at the cost of running the cleaning event more often. But it might be very beneficial to increase this accuracy because all other queries benefit from less data to operate on and less memory consumed by expired rows.
If you go with the simple solution of a regular DELETE event, it is sufficient to schedule the event more often.
If you prefer the TRUNCATE PARTITION option, it is necessary to increase the number of partitions or buckets to 12 (= 2 + TTL / how often to run the cleaning job).
The expression for the calculated bucket column will be `bucket` tinyint(4) GENERATED ALWAYS AS (floor(TO_SECONDS(`created_at`) / TTL) % #buckets) STORED NOT NULL
and the partitioning needs to be adapted as well.
And the CASE construct in the cleaner event must be extended for each newly existing bucket/partition: WHEN n THEN ALTER TABLE test.t TRUNCATE PARTITION p(n+1);
What happens if…
… the event stops?
Then you keep all your rows which will likely create some follow-up problems. As always: Proper monitoring is key. Think about MySQL Enterprise Monitor for example.
… the event procedure runs at inaccurate timing due to overall system load?
No big problem. It will never run too early. So it will never remove rows too early. If it runs too late it will clean rows too late so you have more garbage in your table which might affect other queries. The real TTL is increased if this happens.
Performance Considerations
By no means I am able to run proper performance tests. I am running on a Win10 laptop, VirtualBox with Oracle Linux and MySQL runs inside a Docker container. So plenty of reason to achieve bad numbers. But to compare the two implementations it should be sufficient.
I have extended the cleaner events to report the time needed to execute the event procedure. Here the example of the simple cleaner job: CREATE EVENT ttl_delete ON SCHEDULE EVERY 10 SECOND STARTS ‘2019-03-04 16:00:00’ DISABLE DO BEGIN DECLARE t1,t2 TIME(6); SET t1=current_time(6); DELETE FROM t WHERE created_at < NOW() - INTERVAL 10 SECOND; SET t2=current_time(6); INSERT INTO ttl_report VALUES (“DELETE simple”, now(), timediff(t2,t1)); END| DELIMITER ;
The load was generated by mysqlslap, which only inserted rows in the table. Each test run starts the respective cleaner event, runs the mysqlslap load and then stops the cleaner event. mysql -h 127.0.0.1 -uroot -pXXX -e “USE test; ALTER event ttl_delete ENABLE;”
mysql -h 127.0.0.1 -uroot -pXXX -e “USE test; ALTER event ttl_delete DISABLE;”
mysql -h 127.0.0.1 -uroot -pXXX -e
“USE test; ALTER event ttl_truncate ENABLE;”
mysqlslap -h 127.0.0.1 -uroot -pXXX –create-schema=test –concurrency=5 –iterations=20 –number-of-queries=10000 –query=”INSERT INTO test.t (created_at, content) VALUES (NULL,md5(id));” mysql -h 127.0.0.1 -uroot -pXXX -e “USE test; ALTER event ttl_truncate DISABLE;”
The results are clearly in favor of truncating partitions. And the difference should be even higher the higher the INSERT rate gets. My poor setup achieved only less than 1000 inserts per second…
select who, avg(how_long) from ttl_report GROUP BY who; +—————+——————–+ | who |avg(how_long) | +—————+——————–+ | DELETE simple | 1.1980474444444444 | | truncate | 0.0400163333333333 | +—————+——————–+ 3 rows in set (0.0014 sec)
Side note
You might wonder why my test load is INSERT INTO test.t (created_at, content) VALUES (NULL,’foo’);”
Why do I mention the column “created_at” but then store NULL to give it the default of current_timestamp? If I omit the created_at column in this INSERT statement I get an error from the generated column due to bug #94550. Setting explicit_defaults_for_timestamp to OFF and then mentioning the timestamp column during INSERT is a workaround.
Feed: Planet MySQL ;
Author: Dave Stokes ; I have had some requests to write some blogs on the basics of using PHP and MySQL together. This will not be a series for the experienced as it will start at a level where I will go into a lot of details but expect very few prerequisites from the reader. If this is not you, please move on. If it is you and you read something you do not understand, please contact me to show me where I assumed too much.
PHP and MySQL are both in their mid twenties and both vital in the worlds of developers. With the big improvements in PHP 7 and MySQL 8, I have found a lot of developers flocking to both but stymied by the examples they see as their are many details not explained. So let’s get to the explaining!
1. Use the latest software
If you are not using PHP 7.2 or 7.3 (or maybe 7.1) then you are missing out in features and performance. The PHP 5.x series is deprecated, no longer support, and is quickly disappearing.
MySQL 8.0 is likewise a big advancement but many sites are using earlier versions. If you are not on 5.6 with plans to upgrade to 5.7 then you are about to be left behind. If you are running an earlier version then you are using antique code. Also get your MySQL from MySQL as your Linux distro may not be current, especially for the smaller ones. The APT and DEB repos can be found here and there are Docket containers available too.
In many cases it is a fight to keep your software core tools up to date, or fairly up to to date. The time and heartache in fighting problems resolved in a later version of software or creating a work around for a feature not in your older version will eventually bite you hard and should be viewed as a CRM (Career Limiting Move). BTW hiring managers look for folks with current skills not skill for a decade old version of the skills.
2. Do not pick one connector over another (yet!)
PHP is a very rich environment for developers and it has three viable options for connecting to MySQL databases. Please note that the older mysql connector is deprecated, no longer support, and is to be avoided. It was replaced by the mysqli or mysqlnd (native driver) and is officially supported by Oracle’s MySQL Engineers. Next is the PDO (public data objects) connector that is designed to be database agnostic but there is no overall maintainer who watches out for the code but Oracle MySQL Engineers do try to fix MySQL related issues if they do not impinge on other PDO code. And the newest, using the new MySQL X DevAPI and X protocol is the X DevAPI connector which supports both SQL and NoSQL interfaces.
The good news for developers is that you can install all three, side by side, with no problem. For those staring out the ability to transpose from connector can be a little confusing as they work just a bit differently from each other but the ability to use two or more is a good skill to have. Such much like being able to drive a car with an automatic or manual transmission, it does give you more professional skills.
Next time we will install PHP, MySQL, the three connectors, and some other cool stuff so you can start using PHP to access your MySQL servers.
Many companies are now using microservices for both re-architecting existing applications and for starting brand-new projects and initiatives, and they’re investing heavily in doing so. In one surveybyRedHat, 87% of respondents indicated they are using or considering multiple technologies for developing microservices.
It would be prudent for these companies to first take a good look at the benefits and challenges associated with microservices architectures so that they know what they’re in for.
But first—let’s define the term “microservices.”
What are Microservices?
Microservices are an approach to software architecture that break large applications into smaller pieces. As with many architectural decisions, the decision to use microservices is not cut and dried and most likely will involve some trade-offs.
Here are some of the pros and cons of a microservices architecture:
After weighing out these pros and cons in the context of a specific system’s requirements, a decision about whether to use microservices can be made. It’s not the right solution for every system. But, if you’ve been running into a wall trying to accomplish some of the things that microservices provide, it can be a godsend.
Microservices Challenges and Benefits – An Example
Let’s look at a specific example showing the challenges and benefits of using microservices.
Imagine you have an ecommerce application with four main features: 1) a product catalog, 2) a shopping cart, 3) an affiliate program, and 4) customer profile management, and you have a single “monolithic” service that supports all of these features.
Your system might look like the following diagram. You can scale out your system by adding multiple instances of your application, and you can take all of your incoming traffic and balance those requests across all of your services.
However, you also have some issues:
There are few people at your company who understand the service top to bottom. New hires in engineering take several months to become effective.
As your company has grown, your application has become more and more complex with many interdependencies. It is difficult to deploy new versions of your service because it requires extensive testing to make sure that all the pieces of the application work together seamlessly.
You are deploying more and more copies of your applications to handle spiking traffic from your affiliate clickstream, even though the other parts of your traffic are relatively predictable and stable.
Now, suppose you break this single application into four microservices, one for each feature (product catalog, shopping cart, affiliate program, and customer profile management). By doing so, the following changes are enabled:
Your engineering staff is broken into four separate teams; each team is responsible for knowing their microservice.
The product catalog team decides they want to use search technology to power their product catalog, rather than a relational database system; likewise, the team running the affiliate system decides to use a NoSQL database to store their time-series data; and the shopping cart team decides to host their data in a SaaS-based ecommerce system.
The team that handles system deployments adopts a new scaling strategy and is able to scale instances of the affiliate microservice separately from other microservices.
Our new, microservices-based system might look like the following diagram:
The benefits we’re reaping from our microservices architecture are wonderful.
However, there are some new challenges introduced. Because our data is now distributed between systems (and even different databases), some operations are not as simple as they were before.
For instance, to make sure that we have data integrity between our systems, we can no longer rely on a relational database transaction (i.e., “BEGIN TRAN/COMMIT TRAN”) to make sure that data is consistent between tables; instead, we have to look at other mechanisms that are enforced in our application tier, not in the data tier.
Or, something as simple as joining two tables together to get an aggregate view of data cannot be done anymore (how would you join a relational database table to a search engine?). Again, we have to accomplish the aggregation of data via application-layer mechanisms.
Microservices Trade-offs – The Bottom Line
In order to minimize the impact of these types of challenges, it is really important to understand your access patterns and use cases up front and to define the boundaries between your microservices thoughtfully. This is the most important piece of designing an effective microservices architecture.
I recently attended a lecture given by Martin Fowler, a tech thought leader who was one of the first folks to speak about microservices. Someone in the audience asked him, “What’s one of the biggest problems with microservices?” Martin replied, “The fact that people are doing them.” After some nervous laughter from the audience, Martin further explained that microservices only make sense at a certain scale and complexity; he said that small companies with small systems do microservices in an effort to solve problems that don’t exist yet.
Microservices architectures—like most technology architecture decisions—involve trade-offs. It’s important to understand what problems they solve and what new problems they introduce, and then, with a solid understanding of your requirements, apply them where they make sense.
DataStax Enterprise and Apache Kafka for Modern Architectures (White Paper)
The past year has seen in-memory data grids (IMDG) continue to gain traction with the development community and large organisations alike. As you’ll see in the 2019 IMDG LinkedIn Survey results below, adoption of IMDG as a skill in LinkedIn profiles has risen by 43% YoY. Companies are turning to IMDGs as replacements for RDBMS and NoSQL solutions that struggle to perform at scale. The trend sees IMDGs used as a cornerstone for projects related to Digital Transformation strategies. A key driver for adoption is the ease in which IMDGs drop into new deployment platforms, such as Kubernetes and Cloud, an area in which traditional processing data stores struggle. IMDGs are a viable option over a NoSQL solution for varied technical reasons.
Easier to scale out
Cluster members co-ordinate amongst themselves for their share of partitioned data. There is no third party coordination as is the case with most NoSQL solutions. Each member of an IMDG cluster handles a portion of primary data partitions and a similar number of replica partitions. There is no concept of master or replica processes.
A better option for varied data retrieval
Some NoSQL solutions like Redis require knowledge of the key; there is no facility to query data based on a property. Instead, users have to maintain multiple data structures with the property as a key, known as reverse indexes. IMDGs offer key-based retrieval and SQL like queries where only properties are known, in much the same way as a relational database operates.
Faster data retrieval
IMDGs have a facility called near caches, which allow frequently read data to be stored in a cache within the client to the cluster. This means that once read, the value will stay in the client process memory space until the value changes in the central cluster, at which time the cluster sends an invalidation message to the client. Popular NoSQL solutions do not offer this. General data retrieval from the cluster to a client is also faster, and everything is stored in-memory.
More efficient under mixed workloads
IMDGs are multi-threaded, whereas most NoSQL stores, like Redis, are single threaded. Single threads impact performance under varied workloads, for example, when a Redis Lua Script is being run for a compute job in the cluster no other transactions within that process can proceed. This can be particularly problematic if the script is long running. For IMDGs throughput can be maintained with multiple compute jobs and data retrieval operations at the same time. MongoDB does not offer a distributed compute facility at all.
Are much more than just a Data Store
IMDGs are not only an excellent choice as an elastic and resilient in-memory data store. They can also be used as a framework to build your own Distributed Systems. Most IMDGs now provide various atomic and lock APIs that aid in the building of services, pair this with the excellent event callbacks available for data mutations and IMDGs become a valuable tool for building microservice architectures.
Embeddable
The IMDGs reviewed in this blog are Java-based libraries. This means you can embed a cluster and the data structures directly within your applications. With NoSQL, you are forced to use a Client-Server architecture.
2019 IMDG LinkedIn Survey Results
The survey is a search for jobs and profiles based on the keyword of the IMDG product. It’s simple enough to verify the results found here independently. Some open source projects identify themselves differently than their commercial product namesakes, such as Pivotal Gemfire and Apache Geode. Where there are two names, I have searched for both and combined the results. This may lead to a double count in some instances, so for these products the results may be artificially increased. Other products, such as Hazelcast, carry the same name for their open source and commercial versions.
I’ve chosen what I consider to be the top 5 IMDGs, four of them are open source and one, Oracle Coherence is proprietary closed source.
Hazelcast
Oracle Coherence
Gemfire / Apache Geode
GridGain / Apache Ignite
JBoss Data Grid / Infinispan
IMDG Products mentioned in LinkedIn Profiles.
People mentioning an IMDG as a skill in their LinkedIn profile has grown by 43% YoY over 2018. A vital statistic and useful indicator of IMDG popularity with engineers and businesses alike. This is an indicator of the available talent pool, an important consideration when weighing up IMDG products against each other. This metric alone can have a strong influence on IMDG product selection within businesses.
Search Keywords
Feb 2019
Hazelcast
7,623
Oracle Coherence
5,261
Gemfire/Apache Geode
4,031
GridGain/Apache Ignite
2,376
JBoss Data Grid/Infinispan
1,891
All of the IMDGs listed here increased their profile count, Hazelcast maintained and expanded the lead it held in 2018. Second and Third places were swapped, with Oracle Coherence surprisingly pulling ahead of Pivotal Gemfire. Apache Ignite has pulled itself off from the bottom of the table with Infinspan dropping to take the wooden spoon.
Search conducted on 11th February 2019.
IMDG Product Job Listings
Job listings are another great indicator of IMDG popularity and as mentioned, this metric is up from 2018 for all of the IMDG products we searched except for Oracle Coherence. No movement at the top, with Hazelcast staying in place as the IMDG with most job opportunities. Hazelcast has twice as many job openings as the second nearest, Pivotal Gemfire. 2018 saw a significant drop in requirements for Oracle Coherence positions.
Search Keywords
Feb 2019
Hazelcast
617
Gemfire/Apache Geode
290
GridGain/Apache Ignite
235
JBoss Data Grid/Infinispan
91
Oracle Coherence
85
Search conducted on 11th February 2019
Conclusion
The Trend for IMDGs is most definitely increasing as evidenced by the data above, that said, they’re still a relatively unknown resource. IMDGs can take their place alongside NoSQL and RDBMS in a multi-faceted solution. More and more architects realise that a NoSQL solution on its own will not solve their future data storage and processing requirements. A wider variety of more adaptable data solutions are required and for this architects are turning to IMDGs. Many IMDGs have been developing complementary solutions based on the core IMDG platform, for example, Hazelcast Jet which is an in-memory streaming platform that provides stream processing at ingestion rates far exceeding competing disk bound solutions such as Apache Flink or Apache Kafka Streams. Once again these solutions come with the added benefit of out-of-the-box cluster coordination. No extra processes are required, such as a Zookeeper instance.
The original version of this blog can be found at davebrimley.com.
We’re honored to announce that Forrester Research this week cited Redis Labs as a Leader in The Forrester WaveTM: Big Data NoSQL, Q1 2019. In this evaluation, we believe that the leading independent research firm placed us as a leader based on Redis’ high-performance and multi-model approach, which powers a broad set of enterprise applications. In fact, we received the third-highest score out of 15 vendors in the current offering category, with Forrester noting that “Customer references like its innovation for machine learning apps, performance, scale, customer support, and support for diverse NoSQL use cases.”
For this research, Forrester identified, analyzed and scored key-value, document and graph database vendors across 26 criteria, evaluating each provider’s offering, strategy and market presence. The report highlights some of Redis and Redis Enterprise’s key capabilities, stating “Redis is a multi-model, open source, in-memory database platform whose key development is currently sponsored by Redis Labs. Redis supports both relaxed and strong consistency, a flexible schema-less model, high availability, and ease of deployment. An enterprise version encapsulates the open source software and provides additional capabilities for geo-distributed active-active deployments (multi-cloud, hybrid, on-premises) with high availability and linear scaling, while supporting the open source API.”
NoSQL has become critical for business to support a new generation of applications — which depend on real-time data to deliver an instant experience, requiring databases that can be deployed on multiple clouds, on-premises or in a hybrid architecture. This demand helped the category grow leaps and bounds over the past decade, as NoSQL went from supporting simple schema-less apps to becoming a mission-critical data platform for large Fortune 1000 companies (including the 70% of Fortune 10 companies that use Redis Enterprise as their primary database). According to Forrester, half of global data and analytics technology decision makers either have implemented or are implementing flexible NoSQL platforms.
To learn more, read our press release or click to download the full research.
NoSQL is more than a decade old! Only a few years ago we were talking about how NoSQL was still maturing and its ecosystem still evolving. Today, NoSQL is used by more than half of large companies around the globe to support all kinds of workloads. Companies like its flexible, schemaless model, portability and lower cost. While most organization still complement their relational databases with NoSQL, some have already started to replace them.
With more than two dozen NoSQL vendors, choosing the right NoSQL product is no longer simple. The good news is that we just published the new Forrester Wave on NoSQL! In our 26-criterion evaluation of NoSQL providers, we identified the 15 most significant ones — Aerospike, Amazon Web Services (AWS), ArangoDB, Couchbase, DataStax, Google, IBM, MarkLogic, Microsoft, MongoDB, Neo4j, Oracle, RavenDB, Redis Labs, and SAP — and researched, analyzed, and scored them. The report also includes trends and what criteria mater the most when choosing a NoSQL vendor. Besides new features evaluated in this Wave, two vendors to call out for their stronger momentum are Microsoft and Google. Both vendors have jumped into the Leaders category through their aggressive product offering, and stronger market penetration.
Forrester clients can access the report to see how each NoSQL provider measures up in order to make the best choice for their specific needs. If you are not a Forrester client, then you may be able to get a free reprint by visiting the website of one of the vendors mentioned above. If you would like to discuss the report or your NoSQL strategy in greater detail, please set up an inquiry.
In this tutorial, we’ll see how you can use MySQL in Node.js by creating a connection and executing SQL queries for performing CRUD operations.
We’ll be looking at the node-mysql module for connecting to a MySQL server from your Node.js applications.
Using MySQL in Node.js
You can use MySQL in Node.js through various modules such as node-mysql or node-mysql2.
Let’s see how we can use the node-mysql2 (fast node-mysql compatible mysql driver for node.js) for connecting to create a database and perform CRUD operations against it.
In nutshell, these are the required steps to use MySQL in Node:
Create a folder for your project and navigate inside it: mkdir node-mysql-example && cd node-mysql-example,
Add a package.json file using the npm init –y command,
Install the node-mysql2 module from npm using the npm install mysql2 –save command,
Create a server.js file and add the code below,
Run the application using the node server.js.
Creating a Node.js Project
Let’s start by creating our Node.js project. First, create a folder for your project using the following command:
$ mkdir node-mysql-demo
Next, navigate inside your project’s folder and create a package.json file:
$ cd node-mysql-demo
$ npm init -y
This will create a package.json with default values.
Installing the MySQL Driver for Node.js
After creating a project, you can install the node-mysql2 module using npm:
$ npm install mysql2 --save
This will add node-mysql2 the node_modules folder of your project (which will be created if not exists) and add it to the dependencies array of the package.json file.
Connecting to your MySQL Database
In your project’s folder, create a server.js file. Open it and add the following code to import mysql2 and use it to create a connection to your MySQL server:
Before doing anything else we need to create database and a table. You can use the mysqlcommand in your terminal to create a database. First run the following command in your terminal:
Next, run this SQL instruction:
createdatabasetest;
Next, you need to create a table by adding the following code:
Now that we have a database and a contacts table. Let’s see how to perform CRUD operations using SQL queries.
Performing CRUD Operations
CRUD stands for create, read, update and delete and it refers to common operations that are used in most data-driven applications.
You create data in the database tables using the INSERT statement.
You read data from the database tables using the SELECT statement.
You update data in the database tables using the UPDATE statement.
You delete data from the database tables using the DELETE statement.
Creating/Inserting Data
connection.query('INSERT INTO contacts SET ?',["name 001","name001@email.com"],(err,res)=>{if(err)throwerr;});
Reading/Selecting data
connection.query('SELECT * FROM contacts',(err,rows)=>{if(err)throwerr;console.log(rows);});
The rows variable contains the returned rows from the database table.
Updating Data
connection.query('UPDATE contacts SET email = ? Where ID = ?',['updated@email.com',1],(err,result)=>{if(err)throwerr;});
Deleting Data
connection.query('DELETE FROM contacts where id = ?',[1],(err,result)=>{if(err)throwerr;});
Conclusion
In this tutorial, you have seen how you can use the node-mysql2 driver for opening connections to MySQL databases in your Node.js applications and you created a simple CRUD example that demonstrates how to perform basic create, read, update and delete operations via SQL select, insert, update and delete statements.
A version of this blog post first appeared in the developer-oriented website, The New Stack. It describes how MemSQL works with Apache Kafka to guarantee exactly-once semantics within a data stream.
Apache Kafka usage is becoming more and more widespread. As the amount of data that companies deal with explodes, and as demands on data continue to grow, Kafka serves a valuable purpose. This includes its use as a standardized messaging bus due to several key attributes.
One of the most important attributes of Kafka is its ability to support exactly-once semantics. With exactly-once semantics, you avoid losing data in transit, but you also avoid receiving the same data multiple times. This avoids problems such as a resend of an old database update overwriting a newer update that was processed successfully the first time.
However, because Kafka is used for messaging, it can’t keep the exactly-once promise on its own. Other components in the data stream have to cooperate – if a data store, for example, were to make the same update multiple times, it would violate the exactly-once promise of the Kafka stream as a whole.
MemSQL is fast, scalable, relational database software, with SQL support. MemSQL works in containers, virtual machines, and in multiple clouds – anywhere you can run Linux.
This is a novel combination of attributes: the scalability formerly available only with NoSQL, along with the power, compatibility, and usability of a relational, SQL database. This makes MemSQL a leading light in the NewSQL movement – along with Amazon Aurora, Google Spanner, and others.
The ability to combine scalable performance, ACID guarantees, and SQL access to data is relevant anywhere that people want to store, update, and analyze data, from a venerable on-premise transactional database to ephemeral workloads running in a microservices architecture.
NewSQL allows database users to gain both the main benefit of NoSQL – scalability across industry-standard servers – and the many benefits of traditional relational databases, which can be summarized as schema (structure) and SQL support.
In our role as NewSQL stalwarts, Apache Kafka is one of our favorite things. One of the main reasons is that Kafka, like MemSQL, supports exactly-once semantics. In fact, Kafka is somewhat famous for this, as shown in my favorite headline from The New Stack: Apache Kafka 1.0 Released Exactly Once.
What Is Exactly-Once?
To briefly describe exactly-once, it’s one of three alternatives for processing a stream event – or a database update:
At-most-once. This is the “fire and forget” of event processing. The initiator puts an event on the wire, or sends an update to a database, and doesn’t check whether it’s received or not. Some lower-value Internet of Things streams work this way, because updates are so voluminous, or may be of a type that won’t be missed much. (Though you’ll want an alert if updates stop completely.)
At-least-once. This is checking whether an event landed, but not making sure that it hasn’t landed multiple times. The initiator sends an event, waits for an acknowledgement, and resends if none is received. Sending is repeated until the sender gets an acknowledgement. However, the initiator doesn’t bother to check whether one or more of the non-acknowledged event(s) got processed, along with the final, acknowledged one that terminated the send attempts. (Think of adding the same record to a database multiple times; in some cases, this will cause problems, and in others, it won’t.)
Exactly-once. This is checking whether an event landed, and freezing and rolling back the system if it doesn’t. Then, the sender will resend and repeat until the event is accepted and acknowledged. When an event doesn’t make it (doesn’t get acknowledged), all the operators on the stream stop and roll back to a “known good” state. Then, processing is restarted. This cycle is repeated until the errant event is processed successfully.
MemSQL Pipelines provide exactly-once semantics when connected to the right message broker
How MemSQL Joins In with Pipelines
The availability of exactly-once semantics in Kafka gives an opportunity to other participants in the processing of streaming data, such as database makers, to support that capability in their software. MemSQL saw this early. The MemSQL Pipelines capability was first launched in the fall of 2016, as part of MemSQL 5.5; you can see a video here. There’s also more about the Pipelines feature in our documentation – original and updated. We also have specific documentation on connecting a Pipeline to Kafka.
The Pipelines feature basically hotwires the data transfer process, replacing the well known ETL (Extract, Transform, and Load) process by a direction connection between the database and a data source. Some limited changes are available to the data as it streams in, and it’s then loaded into the MemSQL database.
From the beginning, Pipelines have supported exactly-once semantics. When you connect a message broker with exactly-once semantics, such as Kafka, to MemSQL Pipelines, we support exactly-once semantics on database operations.
The key feature of a Pipeline is that it’s fast. That’s vital to exactly-once semantics, which represent a promise to back up and try again whenever an operation fails.
Like most things worth having in life, exactly-once semantics places certain demands on those who wish to benefit from them. Making the exactly-once promise make sense requires two things:
Having few operations fail.
Running each operation so fast that retries, when needed, are not too extensive or time-consuming.
If these two conditions are both met, you get the benefits of exactly-once semantics without a lot of performance overhead, even when a certain number of crashes occur. If either of these conditions is not met, the costs can start to outweigh the benefits.
MemSQL 5.5 met these challenges, and the Pipelines capability is popular with our customers. But to help people get the most out of it, we needed to widen the pipe. So, in the recent MemSQL 6.5 release, we announced Pipelines to stored procedures. This feature does what it says on the tin: you can write SQL code and attach it to a MemSQL Pipeline. Adding custom code greatly extends the transformation capability of Pipelines.
Stored procedures can both query MemSQL tables and insert into them, which means the feature is quite powerful. However, in order to meet the desiderata for exactly-once semantics, there are limitations on it. Stored procedures are MemSQL-specific; third-party libraries are not supported; and developers have to be thoughtful as to overall system throughput when using stored procedures.
Because MemSQL is SQL-compliant, stored procedures are written in standard ANSI SQL. And because MemSQL is very fast, developers can fit a lot of functionality into them, without disrupting exactly-once semantics.
Pipelines are Fast and Flexible
The Pipelines capability is not only fast – it’s also flexible, both on its own, and when used with other tools. That’s because more and more data processing components can support exactly-once semantics.
For instance, here are two ways to enrich a stream with outside data. The first is to create a stored procedure to do the work in MemSQL.
The following stored procedure uses an existing MemSQL table to join an incoming IP address batch with existing geospatial data about its location:
CREATE PROCEDURE proc(batch query(ip varchar, ...)) AS BEGIN INSERT INTO t SELECT batch.*, ip_to_point_table.geopoint FROM batch JOIN ip_to_point_table ON ip_prefix(ip) = ip_to_point_table.ip; END
(For a lot more on what you can do with stored procedures, see our documentation, which also describes how to add SSL and Kerberos to a Kafka pipeline.)
You can also handle the transformation with Apache Spark, and you can do it in such a way as to support exactly-once semantics, as described in this article. (As the article’s author, Ji Zhang, puts it: “But surely knowing how to achieve exactly-once is a good chance of learning, and it’s a great fun.”)
Once Apache Spark has done its work, stream the results right on into MemSQL via Pipelines. (Which were not available when we first described using Kafka, Spark, and MemSQL to power a model city.)
Use Kafka, Spark, MemSQL Pipelines, and stored procedures for operational flexibility with exactly-once semantics
Try it Yourself
You can try all of this yourself, quickly and easily. MemSQL software is now available for free, with community support, up to a fairly powerful cluster. This allows you to develop, experiment, test, and even deploy for free. If you want to discuss a specific use case with us, contact MemSQL.