 Xin chào mọi người. Xin chào mọi người. Lúc phút bộ về 2 năm, mình đã chia sẻ về phim xuyên, làm dữ liệu của mình khi mình mới bắt đầu sự nghiệp là năm 2016. Lúc đó mình mới ra trường thôi. Mới là sinh viên mới ra trường thôi. Mình đã phải gặp những bài toán như thế nào, và Thằng Ford Goy nó đã cứu mình như thế nào. Nội rung của bữa nay, đầu tiên là mình sẽ nhắc lại qua 1 tý về chuyện là lúc đấy, Mình là ai? Không phải bây giờ mà lúc đấy, mình là ai? Sau đó thì... Sau đó thì sẽ là cái bài toán mà mình đã gặp phải khi mà mình được giao cái công việc đầu tiên, giao cái nhiệm vụ đầu tiên. Sau đó thì sẽ là những cái thách thức sau đó mình gặp phải và thằng Corgris nó đã xử lý nó như thế nào. Phần cuối cùng thì nếu mà có thời gian thì mình sẽ chia sẻ thêm 1 cái challenge, 1 cái bài toán mà mình thường hay gặp phải trong các hệ thống analytics. Nó chỉ là 1 cái bài toán còn là đếm số đơn giản thôi. Rồi, thì phần đầu tiên mình là ai vào lúc đó? Thì vào lúc đấy thì năm 2016 thì mình mới tốt nghiệp và kinh tế. Mới tốt nghiệp bằng xuất sắc kinh tế quốc dân. Không có cái bay gì về IT, không có cái bay gì về công nghệ hay là xử lý dữ liệu gì hết. Thì tự nhiên là mình thấy là cái ngành kinh tế nó không phù hợp với mình lắm. Thế nên là mình cũng có chuyển hướng sang đấy IT. Công việc đầu tiên sau khi mình ra rời khỏi nơi EU thì nó lại là PHP Developer làm việc cho Aputa là 1 công ty lúc đấy là phát hành game. Rồi, ok. Tại sao mình lại phải nhắc đến cái điều này? Bởi vì nó sẽ ảnh hưởng đến cái bài toán cái sau này này. Mình mới ra trường, không có bay gì về IT, không có bay gì về dữ liệu, dữ liết các thứ. Nhưng mà mình đã được giao 1 cái bài toán, nhìn thì nó rất là đơn giản. Là mình sẽ phải xây dựng 1 cái hệ thống báo cáo dân thu cho tất cả những sản phẩm game của Aputa. Thì lúc đấy nó đang có khoảng tầm 30 triệu user. Với khoảng, theo mình nhớ không nhầm lúc đấy nó khoảng có vài triệu dao dịch rồi. Thì cái yêu cầu nó đơn giản thôi là mình pun cái dữ liệu từ những cái hệ thống payment của Aputa về. Sau đấy là mình làm những cái báo cáo xem là ở hôm nay dân thu bao nhiêu, hôm qua dân thu bao nhiêu. Đấy, có như vậy thôi. Thì cái bài cái mà khó nhất ở đây thì lúc đấy mình gọi là Big Data, thật ra là nó chưa Big lắm. Nhưng mà với mình lúc đấy thì nó là Big, với khoảng vài triệu dao dịch. Thì hôm đấy mình đã nghiên cứu 1 số cái giải pháp. Đấy, cũng đi tìm hiểu xem là ở trên thế giới người ta đang xử lý dữ liệu lớn như thế nào. Thế thì cái đầu tiên đập vào mắt mình đấy là hệ sinh thái của HADUC. HADDFS là hệ sinh thái chuyên về xử lý dữ liệu lớn rồi. Với các những giải pháp về data bay này, rồi là về processing, rồi là về queo các củng các thứ. Đấy, nhưng mà các bạn sẽ thấy là ngay tức là nó với một kiểu sinh viên vừa về ra trường, có cái vây gì IT nhưng mình đập vào cái hệ thống này như một cú tắt vào mặt. Đúng không? Làm sao mà mình có thể dựng được cái hệ thống HADUC như thế này. Đó, thì mình bắt đầu nghiên cứu những giải pháp liên quan đến business intelligence khác. Thì sẽ có một số cái tún cả open-source cả trả phí, như là Pentaho này, hay là Talon này, hay là Tableau này. Đó, thì đấy là những cái tún về, rất là nổi tiếng, lúc đấy về business intelligence, là về hệ thống làm báo cáo các thứ đấy. Thế nhưng mà, nhưng mà, nhưng mà những cái tún đấy thì nó quá ma dịch với mình, bởi vì nó hai đi toàn bộ những cái việc xử lý của nó phía sau này, việc tổ chức tư liệu của nó phía sau, và mình cũng lúc đấy mình đập những cái tún đấy, mình bảo là không biết nó tại sao nó lại chạy như thế, không biết tại sao nó lại ra những cái số như thế, thì và nó có rất là nhiều feature mình gần như không dùng hết. Đấy, trong khi mình chỉ có mỗi một cái mong muốn đấy là liu lại những cái hỏng vài triệu giao dịch và mình tính tính tóán ở trên mấy triệu giao dịch ấy thôi. Thì ở góc nhìn của một developer thì mình thấy là cái này mình không control được, mình không gọi là biết được xem là nó có những sai số gì, rồi là tại sao nó lại ra cái giữ liệu như thế. Đó. Thì lúc đấy thì mình mới quyết định là ok thế thì bây giờ mình không dùng được tún để mình tự dựng. Thì trong cái góc nhìn của mình lúc đấy thì một cái hệ thống về báo cáo giữ liệu với cái yêu cầu đơn giản như thế, mình làm đúng những yêu cầu như thế luôn. Là ở đây mình có những cái data bay nguồn, data bay nguồn nó sẽ chứa cái giao dịch chính. Thì mình bắt đầu viết một cái con là worker để mình sinh cái, mình lấy cái giữ liệu từ những cái nguồn đấy. Thì thật ra cái này nó rất là đơn giản, nó là cái current job chạy cellular để mình get API, mình lấy giữ liệu ra thôi. Sau đó thì mình dùng một cái data bay mà công ty mình dùng rất là phổ biến hồi đấy, mình lưu vào trong MySQL, sau đó thì mình có tạo thêm một cái dashboard để mình truy vấn vào cái thằng MySQL đấy, lấy báo cáo. Thì cái hệ thống này nó cũng quốc, nhưng mà tổi thọ của nó thì không được dài lắm. Nó chỉ có 4 tháng là mình phải đập đi viết lại, bởi vì sao? Bởi vì lúc đấy thì khi mình chọn thằng MySQL là cái data bay chính cho analytics của mình thì thằng MySQL nó được thích kế cho cái workload gọi là OLTP, online processing, tức là những cái hệ thống hoạt động mình thường. Còn cái yêu cầu của mình đấy là hệ thống online analytics processing, tức là mình cần những cái query rất là phức tạp về thống kê, rồi là về tính toán, về aggregate, thì thằng MySQL nó lại chơi được mạnh cái đấy lắm, nó chơi được mạnh cái đấy lắm. Thằng MySQL nó chỉ support một vài cái kiểu dữ liệu. Cứ bản thôi, kiểu như là in hay là string các kiểu va cha thứ thôi. Chứ bọn mình thì lại cần những cái dữ liệu mà nó khá là phức tạp, phức tạp kiểu như là nó cần phải nest này, rồi là nó phải có những cái custom data-tai khác nữa. Thì nó đang bị có cái sự khác nhau giữa cái thằng mà mình chọn để làm data bay và cái mà bọn mình đang muốn. Hoặc là thằng MySQL này được thiết kế cho việc là những cái hệ thống mà read, nó read rất là nhanh, nhưng mà nó chỉ read nhanh với số lượng bản ghi trả về nhỏ trong tập data set lớn. Chứ còn nếu mà bây giờ mình muốn là mình aggregate, tức là mình cộng tất cả những cái bản ghi khoảng độ một vài triệu bản ghi lại chẳng hả? Thì nó làm rất là tệ. Thì đấy là lý do tại sao mà mình phải đập cái hệ thống ấy đi, viết lại nó bằng đến version 2. Thì như ở cái chỗ trước thì mình có nói là cái hệ thống của mình nó phải là hệ thống write here v, bởi vì là mình kéo, liên tục mình kéo dữ liệu về mình update vào trong DB, và cái tài write của mình sẽ nhiều hơn là việc user vào xem báo cáo. Thế thì lúc này mình mới ra hưởng nghĩ đến một cái con write tốt hơn, đấy là thằng MongoDB. Nó là thằng MySQL. Thì sau đấy mình vào thay cái con MongoDB bằng thay con MySQL bằng MongoDB. Thì ok là cái việc write của nó rất là tốt. Mình có thể ghi dữ liệu vào rất là nhanh, rất là đổ, chứa được rất lợi lớn để hành vài chăm gích đến cà tê. Thế nhưng mà nó lại có một cái vấn đề rất là rở hơi từ việc thiết kế của nó, đấy là nó là document. Cái thiết kế của nó là dạng đi nomolite. Thế thì vì nó đi nomolite nên là bọn mình sẽ không thể đi ngồi, đi kiểu query nó. Mình không thể query nó, gọi là onDemand được. Tại sao? Bởi vì là nó dậm. Ok nó vẫn rất là nhanh nếu mà mình ghét một vài bản ghi ra. Bởi vì nó chỉ liên quan đến những bản ghi thôi. Nhưng mà nếu mà mình muốn gọi là lại đi cộng 1 khoảng độ 1 triệu bản ghi với nhau thì thằng này nó làm khá tệ. Mặc dù là nó có cái cơ chế cơ chế hay là thằng map là map review. Nhưng mà cái việc aggregate của nó nhìn chung là vẫn là tệ. Thế nên là chỗ này mình lại phải tiếp tận theo một cái hướng khác. Đấy là mình gần như là mình tính trước hết toàn bộ cái trị số. Thế nên là mình sẽ tính theo ngày để danh thu bao nhiêu, theo giờ danh thu bao nhiêu, tính trước hết, mình liêu sẵn vào trong con ngồi rồi để làm cái con tính trước. Thế thì hệ thống này nó cũng quốc. Nó cũng đáp ứng được cái độ lớn dữ liệu của mình trong 1 khoảng thời gian là 2 tháng. Sau đấy thì khi mà cái yêu cầu về business nó phức tạp hơn mọi người muốn là khi mà tôi vào dashboard tôi muốn query custom cơ. Chứ tôi không muốn xem trước những cái dữ liệu đã tục tính trước như thế này. Đấy thì đến lúc đấy là bọn mình lại phải đập đi. Mình lại phải đập đi lần thứ 3. Sau 6 tháng đến lần thứ 3 mình đập đi. Thì lúc này mình mới được một anh, bạn của anh sếp tư vấn là mày thể dùng portgre. Ok. Thì khi mà làm việc với tháng portgre này thì mình có tổ chức lại. Tổ chức lại cái chỗ phần liêu chữ của mình thì mình có phân giả ở đây. Có 2 cái loại DB mà mình đang dùng ở trong hệ thống 1 thì mình gọi nó là data lake. Tức là khi mà mình phải kéo dữ liệu từ rất là nhiều nguồn khác nhau về thì phải có một chỗ chứa chung. Và những dữ liệu từ nguồn khác nhau có những format khác nhau. Rồi là nó có độ sai số khác nhau, các thứ. Thì cái chỗ chứa chung đấy nó nên là 1 con network quiao. Chứ còn nếu mà bây giờ mình dùng 1 con sql ở đấy, mình lại phải defy ra cái schema cho nó. Và khi thằng data nguồn nó thay đổi thì mình lại phải đổi cái schema của mình. Nó rất là mất công và nó hay gây lỗi. Thì nên là mình vẫn dữ nguyên con môn gồ ở đây để mình collect hết tất cả data về. Thì mình gọi cái quá trình mà lấy dữ liệu không có cho con môn gồ này là quá trình gọi là ETL1 ... lần đầu tiên. Thì ở đây mình sẽ phìn hết tất cả những dữ liệu liên quan đến cái data đấy vào trong con môn gồ, để con nhìn là 1 con chứa chung. Sau đấy thì mình để thêm 1 con, gọi là data quiao để con này phục vụ chuyên cho truy vấn. Chứ là chuyên cho truy vấn thì nó sẽ phục vụ trú yếu là các query dạo OLAP, analytics. Thì mình sẽ có 1 con worker 1 con worker nữa để chuyển hết dữ liệu từ con môn gồ là cái chỗ sứa chung. Vào cái thằng podgre là cái mà chuyên cho query. Đấy thì cái con này nó sẽ làm cái nhiệm vụ là nó biến những cái dữ liệu không có cấu trúc của cụ thể. Đó là nó biến những cái dữ liệu sai lạch rồi là nó gần như nó tiền sử lý cái dữ liệu có cấu trúc mà bọn mình query được mà sao cho nó nhanh. Đấy, ok thì đây là cái kết chúc lần thứ 3 với con podgre của mình. Thì khá là ok là mình thấy nó rất là phù hợp với cái thằng OLAP, bởi vì sao bởi vì thằng podgre hỗ trợ rất nhiều cái dạng custom data time Sau khi mà các bạn làm về OLAP sẽ thấy là ở tôi cần rất nhiều cái nó không phải là số hay là nó là string bình thường. Nó có thể là binary và bên trong nó chữa dữ liệu rồi nó có thể là ghi son các thứ nữa. Đấy thứ 2 đấy là thằng podgre này nó cái aggregate của nó rất nhanh nó có thể, thế là trong 1 query mình có thể accept đến cả triệu bản ghi trong chỉ 1 trong 1 millisecond thôi. Có rất nhiều cái tính năng cho thằng OLAP thì như kiểu window function rồi là material view các thứ đó. Và thằng này rất là dễ mở rộng nữa. Nó có thể rất là dễ mở rộng qua cái ngôn ngữ lập trình PLPG, SQL và plug-in thì cái plug-in của nó thì các bạn có thể thấy cái tác dụng ngay sau khi, ngay ở phần sau cái bài trình bài này rồi, là mình đã dùng 1 số cái plug-in để mình sử lý cái bài toàn đếm như thế nào ok, thế tuy nhiên là nó cũng có 1 số cái dở mà mình chú ý là gặp khi vận hành đó, liên quan đến cái việc là thằng plug-in nó bị lúc đấy của bọn mình bị phụ thuộc vào vertigo scaling tức là để mà tăng cái khả năng sử lý cho cho cái cục plug-in thì chỉ có cách là nâng nâng phần cứng lên thôi, tăng CPU lên tăng RAM lên tại vì nó là 1 con centralize của nó không gọi là chia ra được nhiều server tức là nó có thể tăng cái read bằng replica nhưng mà nói chung là mình vẫn phải tăng CPU tăng mem lên thứ 2 là vì nó có rất là nhiều tính năng advance nên là nếu mà muốn sử dụng nó tốt thì cái việc lên nó rất là thách tham bản chất là mình phải mất 6 tháng mới xây xong được cái hệ thông thứ 3 này chỉ để ngồi tối y quay report-ree và 1 điểm nữa đấy là thằng port-ree này thì cái việc connection pool của nó khá là tệ mình mình thấy nó khá là tệ và nó support rất là ít thế nên là gần như là ai cũng sẽ phải dùng 1 thằng connection pool lơ bên ngoài thì cái việc xét áp thằng bên ngoài này nó làm tăng cái tăng cái tăng cái độ phức tạp tăng cái độ phức tạp của hệ thống lên thế nhưng mà đây mới là cái phần hội là rở hơn nhất mà khi mà mình làm việc report-ree là lúc đấy mình làm việc ở vua 9.5 thì thật sự là vào năm 2016 rất khó để xét áp 1 cái cluster report-ree đúng chuẩn, mình đã phải đọc 1 cái series gồm 6 bài bởi lóc chỉ để xét áp thằng port-ree này với pg pool 6 bài bởi lóc và bài nào dài như cái sớm luôn đó, rất khó chứ còn không như bây giờ, bây giờ có rất nhiều dài pháp để xét áp 1 cái cluster report-ree giống như kiểu là dùng patch ony các thứ nhưng mà hồi đấy không có, rất khó đấy thì với 1 level-per như mình, công ty cũng chưa hay có kinh nghiệm cả thì bây giờ tự nhiên phải ngồi xét áp cả mà port-ree này nó mất thời gian hồi đấy thì ở bản 9 thì cái việc mà mình Pakistan dữ liệu nó cũng chưa support view-in cũng lại phải dùng cái plugin bên ngoài và quản lý bằng tay rất mệt nhưng mà từ bản 10 nó view-in rồi ok cái thốn nhất ở đây là việc upgrade version thì lúc đấy mình nhớ là ở version 9.1.2 gì đấy, thì thật port-ree nó chỉ có physical replica thôi tức là 1 cái cluster port-ree đang chạy thì gần như là các bạn sẽ phải tắt nó tắt hẳn nó đi để upgrade các bạn không upgrade gọi là online được thì lúc đấy nó làm bị những cái vấn đề như này tất nhiên là bản sau này nó đã khác rồi, nó đã improve rất nhiều rồi nhưng mà lúc đấy thì đây là những cái pen lớn nhất ok thì đến phần này mình sẽ chia sẻ 1 cái bài toán mà bọn mình thường hay phải gặp, thường hay phải giải khi làm cái hệ thống xử lý dữ liệu nó lên quan đến vượt là mình chỉ có đếm rồi đấy, cái việc đếm bài toán về đếm thì bài toán về đếm này có 1 cái yếu cầu rất là đơn giản đấy, mình có 1 cái dữ liệu đây là dữ liệu dạng đơn giản hóa đi tức là mình sẽ cần phải đếm số lượng user thì có 1 cái là user id và cái thời gian online của nó chẳng hạn hoặc là thời gian nó xuất hiện tức là nó chỉ có 2 cái trường dữ liệu thích yếu là cái đấy thôi thì bây giờ mình phải đi đếm nờ ở U là new register user bao nhiêu user đang kì mới này bao nhiêu user active hàng ngày hàng tuần, hàng tháng và cái tỷ lệ quay lại của nó là bao nhiêu tỷ lệ quay lại, tức là có bao nhiêu thằng hôm nay vào từ hôm qua đấy, thì dữ liệu của nó đơn giản này thôi magic tính dữ dữ đơn giản nó không có cái gì khó ở đây cả thế nhưng mà cái cái mà vấn đề thực sự ở đây nó lại liên quan đến data size thế là cái hệ thống của mình là hệ thống hệ thống có 30 triệu user rồi đó và cái game của apota lúc đấy thì user online nó cũng khá là nhiều mỗi ngày chắc là cũng phải cơ vài trăm nhìn đến cái triệu đó thì cái data size nó rất lớn thì khi data size nó rất lớn thì nó dẫn đến cái vấn đề là những cái phép tính rất đơn giản như là đếm thôi nó mất thời gian và cái performance này rất là kém, đó bởi vì mình là rất là kém với cả vì data size lớn thế nên là cái việc mà mình inter-set của union, tức là mình rào giữa các cái tập hợp user để mình tính cái tỷ lệ nó quay lại chẳng hạn rất khó bởi vì nó rất là nặng bây giờ cứ mỗi 1 tập hợp đấy là có 100.000 phần tử đi cho 2 tập hợp rào với nhau thì nó rất là lâu mình ví dụ ở đây có 1 mình thử insert vào khoảng là 3 triệu râu 3 triệu cái data vừa xong, cái data mình vừa mấy sầu ở phần trước thì nó đang chiếm khoảng 100m, các bạn có thể thấy là có 3 triệu râu 100m thì có cái gì mà to, thế nhưng nhưng mà đấy là data sample thôi còn cái data thực tế nó sẽ có thêm rất là nhiều cái magic hack như kiểu là nó là dữ liệu của game nào, nó là access từ thiết bị nào, nó có rất nhiều magic và cái data chỗ này nó tăng lên rất là nhiều lần đấy, thì đấy là cái problem thì cái cái approach đầu tiên mà mình có thể nghĩ đến đấy là ở trong cái dữ liệu ban đầu này này thì mình thấy là có rất là nhiều cái việc là nó leo chung data, chung không phải là chung tức là data nó không phải là chung mà tức là data nó có cái phần lập lại ví dụ như phần ngày tháng này hả với 2 user này, tự nhiên mất 2 mất 2 râu để mình lưu user ID vào ngày tháng trong khi là phần ngày tháng là phần dữ liệu lập lại ok, thì thì mình có nghĩ đến cái sải phát đầu tiên đấy là mình dùng cái inner ray inner ray này, tức là bây giờ mình go pet tất cả user nó thành 1 cái 1 cái mạng, 1 cái mạng chung đây thì đây là 1 cái modern build in của thằng potwin nhé thì mình đã giảm được cái dữ liệu xuống là 10m, thì tuy nhiên mình vẫn và cái phần cao này nó nhanh hơn này tức là đến số phần tửa trong 1 cái kia nó rất là nhanh rồi nhưng cái việc mà inter-set và union giữa các cái tập hợp này nó vẫn lâu đó thì sau đấy mình có tìm hiểu thêm 1 cái kiểu dữ liệu tính gần đúng nữa thằng 2 phần lốc lốc thì mình có thể giảm cái sai của nó xuống còn chỉ có 2 kỳ thôi, từ 110 xuống còn 72 kỳ thôi nhưng cái vấn đề của nó là đây là 1 thuật toàn tính gần đúng, tức là số mình đưa ra nó không chuẩn, nó có sai số và mình thường kịp cái sai số mức khoảng 2% thì các bạn có thể cái này có thể qua plug-in của bên sít tốt thì cái việc mà nó estimates cao trong cái tập hợp này rất rất rất nhanh bởi vì nó ta nó bé quá mà nhưng mà thằng này nó chỉ support mỗi cái việc mình không thể biết được chính xác là có những phần tử nào trong cái tập hợp đấy nó chỉ lưu 1 cái dạng gọi là estimates thôi thì đây là 2 cái phương pháp mà mọi mình đã dùng thời đấy thì tuy nhiên là bây giờ bây giờ thì có 1 cái kiểu dữ liệu nữa, đấy là beatmap và roaring beatmap thì nó là kiểu dữ liệu compress của thằng beatmap thì cái mindset lưu beatmap này nó khá là khác biệt là cái việc mình sẽ có 1 cái dãy 0.1, 0.1, 0.1 như này và khi mà có 1 cái user ID nào đấy nó có trong tập hợp thì mình set cái beat tại vị trí đấy lên số 1 thì thằng này được cái là sai của nó thì cũng nhỏ 2 me thôi toàn bộ cái dữ liệu 100 bể của mình xuống còn 2 me nhưng mà nó sẽ cho cái khả năng intersect với union rất rất nhanh bởi vì là nó là các cái dậy các cái chuỗi beat kết hợp với nhau thì cái sử lý của nó rất là nhanh nhưng cái rở của nó đấy là nó chỉ support mỗi number nữa thế là bạn sẽ không thể lưu user ID dặng string vào đây được chỉ có user ID number thì mình mới làm được cái trò này thôi đấy thì đấy là 3 khách mà mình đã mình đã làm thì thời đấy thì chưa có cái plug-in running beatmap này mình đang định viết 1 cái thì mình nghỉ mình nghỉ 1 tim hết rồi nhưng mà bây giờ nó đã có và nó chạy rất stable mình đã thử ok thì đấy là cái phần toàn bộ phần trình bệnh của mình liên quan đến xây dựng hệ thống analytics thì không biết là có bạn nào có có câu hỏi gì trong mình ấm có câu hỏi truyền thị trường có rất nhiều thằng để làm cái hệ thống data analytics liên quan dataquehow thì có thằng 1 thằng nổi tiếng, thằng clickhow thì em muốn không phải chọn clickhow ok thì mình cũng xin trả lời luôn đấy là ở vào năm 2016 thằng clickhow lúc đấy mới mới được public release ra thôi chứ nó chưa phổ biến như bây giờ đâu bây giờ thì ok là mình có thể chọn clickhow nhưng thời điểm đấy thì thằng clickhow nó mới ra vào nó lúc đấy như kiểu mình thì mình không kể biết để nó và những người gọi là trông ngành ở Việt Nam thì mình cũng hỏi chứ chưa thấy ai biết biết thế nào đấy cả các bạn có thể xem lại cái lịch sử là năm 2016 thì thằng đấy mới nó mới được release public nó đã phát triển trước đấy rồi nhưng mà đến năm 2016 nó mới public ra hoặc là nó mới có một cái big release nào đó hết hay thì thật sự là có rất là ít người dùng và cái thứ 2 là như hiện tại sau khi mình thiết kế hệ thống analytics cho công ty hiện tại nhá mình vẫn dùng put release tại vì sao không phải mình không phải quen vì xe lắp cái thằng clickhow kênh nó sẽ có rất nhiều thỉnh quan nó rất phức tạp nhưng mà nó sẽ giải quyết được rất nhiều vấn đề ok là bạn sẽ phải bỏ cái công sức của bạn vào cái việc xe lắp nó và cái việc bên tên nó thì ok thì nó nó là đánh đổi thôi mình thì gọi là simple because nó vẫn đang kiểu nó có thể lý được bằng những cái công cụ nó thường thường xuyên sử dụng ấy mà mọi người quen thuộc với nó thì mình sẽ sử dụng thay vì mình cho một cái công cụ mới hoàn toàn ok em có một câu hỏi đấy hả anh có đề cập tới cái performance của upgrade function của post-v thì em hỏi tại sao upgrade function của post-v nó lại tốt hơn so với myscreen và câu hỏi em ok thì mình nói thật luôn là cái này mình không biết cái này chỉ có chỉ có đi ngồi test thôi còn thật sự là ở phía dưới mà nó sử dụng nó cái gì mà tại sao upgrade nó lạnh nhanh như thế thì đúng là mình không biết ok mình cũng rất quan tâm đến vấn đề podgre vì hiện tại công ty mình thì cũng đang dùng podgre rất là nhiều thì bên mình cũng thấy mình đề cập cái cái cái cái điểm chưa tốt của podgre đó là punning ấy thì muốn mình chia sẻ sau một cái vấn đề điểm dạo của podgre đấy ok thì thật thật thì tức là ban đầu thì thằng connection pooling của thằng connection pooling của thằng podgre thì nó support cái lượng connection khá là nhỏ thôi lúc init thì nó chỉ khoảm độ 100 đến 100 không biết là mình nhớ là 50, 100 gì đấy thôi là cái con số rất là ít và việc mình tăng lên nó gây quá tải cho data bay bởi vì là internet thì với mỗi cái connection đến thì thằng podgre hình như là nó sẽ fork ra một cái cái là kiểu một cái tài nguyên tài nguyên riêng để xử lý, thế nên là nó rất là tốt đấy nên là cái việc là kể cả mình có tăng cái limit đấy lên thì với nhiều connection đến thằng podgre thì cũng nó cũng vẫn tèo, ở data bay nó sẽ tèo thôi thế nên là mình sẽ cần phải có những cái thằng connection pooling đứng trước như kiểu là thằng pgp hoặc phải có tờ để nó re yêu lại cái tài nguyên của thằng tài nguyên cho mỗi connection của thằng db và nó nhóm được những cái query của mỗi connection thôi đấy là lý do tại sao mà cần phải dùng những thằng connection pooling ok, hiện tại thì bên mình đang gặp cái vấn đề này là podgre khi rất nhiều có cái connection truy vấn vào server thì bị hiện tượng là nó phải chọc vào data bay rất là nhiều để nó lấy về cái tham số của cái thủ tục để nó gây cho tốn cái hiệu lang cái performance của server thì không biết ở chỗ mình có gợi ý được cái đeo cái này thì như mình nói thì bạn có thể dùng một thằng connection pooler và nó rất nổi tiếng thằng pgp thằng pgp có nhiều cái mod để mình để nó xử lý cái việc là pooling như kiểu là pooling theo từng connection pooling theo client hoặc pooling theo từng query 1 luôn đấy, thằng chuyên về pooling còn như kiểu mình trước đây mình có dùng thằng pgp 2 thì nó còn liên quan đến cái việc là mình chia tải ra những cái con replica nữa thế là những cái query read thì nó sẽ được chia tải ra cả những con replica còn query write nó sẽ chỉ đi vào master thôi đấy, thì đấy là hai cái giải pháp mà mình đã từng dùng còn nếu mà bạn không cần phải chia tải như thế thì có thể dùng thằng pgp là được thằng đấy thì chuyên pooling rồi ok, xin có một câu hỏi nữa cái bên mình thì cũng có cái bài toán xem xem giống do châu minh ấy thì có cái bài toán đó là bên mình nó có tính duy nợ theo ngày theo tháng và châu quý thì cái lúc mà mình chưa biết đến clickhaw ấy thì ngày xưa cũng dùng podgre thì tính vào cách là vì dưới nhậu là đã chạy qua rồi thì nó ít thay đổi mình tính trước theo tháng theo tháng tính hết xong bắt đầu có những cơm bản những cái thay bồ spyman thay bồ buy quota thật là theo quý thay bồ buy year sau đó là chỉ tính dưới nhậu trong ngày có đâu theo tháng, cứ đế là lấy của tháng nào, cứ lấy ra chắc là mình sẽ ra trả lời bạn riêng sau sát hình của các cái diễn là khác ấy ok, cảm ơn các bạn nhé yeah, thanks the speaker to stop now because the next talk is announced is ready the speaker available for the next talk it's on yes, please we all know we used a person last voice let's quickly pray for Zebra Rees who was a great man of Postgres may he rest in peace Xin chào nice to meet you I feel honored to be here I want to thank to the organizers at PostAsia for providing me with the opportunity to give this presentation the title of my presentation is parallel loading a single copy data file using the pgm copy I am very nervous sorry this presentation explains various tips and the result of consideration on how to load a very large single file data set into the database as quick as possible as quickly as possible let's begin I am Joseph Kim from South Korea I arrived yesterday with co-workers here my name my Korean name my Korean name is Kim Sang Gi Joseph is my Catholic name I've been using Postgres in production since 2005 it's a very long time come to think of it Postgres has improved a lot since that time I am reading Postgres QL DBAs at KTDS KTDS is Korea Telecom Data System a subsidiary at KT Korea Telecom my duty is governance of a stable version of Postgres QL used at KT I guide DBA operation about Postgres and also development to how to use Postgres on USA projects and I search for root causes and root causes of instance or problems and then I find a workaround I formed Korean Postgres QL user group in 2005 I have been a member committee I have been a member of committee of PGD Seoul event since 2016 PGD Seoul event will be held and this year will be 9th of it would it be okay I hope to see you in Seoul this poll this is my email here are the contents here are contents I will talk about the history of the Postgres QL and next I will describe the sample data for importing and exporting again I change the subparameters for that work explain that the copy command has limitation I made PGN copy and tested it and I will explain how to improve create index and lastly I would like to offer suggestion to PGDZ group PGDZ Postgres Global Development Group for parallel copy processing this is history in version 9.6 and 10 parallelization of basic select queries was initiated that was the beginning in version 11 parallelize the DDL command like create index art table and primary key in version 12 client programs became able to parallelize multiple tables which led to reduce the working hours of dba in version 13 vacuum operations was parallelized the important things here is that when a table has multiple indexes postgres parallelized each index cleaning operation very well but still it cannot parallelize a very single tables cleaning operation version 14 enables parallelization in multi instance environments it was cornerstone for implementing sharding cluster in version 15 postgres became able to parallelize to select distinct syntax which has not yet been implemented in version 16 it can parallelize full join or write join queries however it still cannot parallelize copy command and this inability is disadvantage in many way especially if you have a lot of materials to load or dump in a single table parallel copy command was discussed at length by postgres global development group in 2020 but has not officially announced yet my talk today is about exploring the parallelization ability parallelizability of a very simple copy operation I will explain it with sample materials in Korea the government released hospital the government releases hospital material history information every years anyone can look at sorry anyone can look at the years materials history of a million anonymous patients the data is about 12 million cases and the table size is 1.5 gigabytes to handle the large amount of data I target 3 years of data so total data is 37 million rows and 40.5 gigabytes first I copied the table to a file it took 22 seconds I checked the size of that dump file it was 30.2 gigabytes next I truncated the table for importing again and then I copied that table from the file it took 1 minute importing time took 3 times longer than exporting time the reason is that the server write write ahead log and empty data phage to file it is necessary to configure transaction log writing so that it does not interfere with data input as much as possible I changed app sync off synchronous synchronous commit off and full page writes off and I made check point time out as long as possible set the transaction log size as large as possible and then reload this configuration let's check again people changing configurations 1 minute and then after changing 46 seconds it was 25% faster I wanted to know if there is any way to make it even faster I still wonder but here is something I came up with wouldn't it be faster to A. set it as an unordered table B. input data and C. return it as a logged table you can think about it but time is changing unordered table to a logged table is proportional to the amount of data in other words it will take longer to input the data as well as to changing the table properties the backslash copy command in pgsql is internally streaming using pipe therefore it cannot be federalized without processing to five parts of the file people I work with it I need to know the file format of the output of the copy operation the file created by the copy to command is each line which is a single row which is the single row each column is separated by a tab character character like new line and tab are replaced with with their respective escaping character as column values the simplest way to import this file is to break it up into multiple tasks and to import the resulting file to break it up spread it into one in end files based on the total number of lines each of those files will be used with each one pgsql command you can federalize the command copy command by using the spritz command a unix command that spritz a large file into multiple smaller files however the file spreading operation itself take a time furthermore the spritz process release the file sequentially and spread it into multiple files we cannot reduce the time to read the original file sequentially from beginning to end here are second case to solve this program I made a new program the program creates as many threads as it can parallelize and each thread release and import only a file it needs to read pgm copy pgm copy is a multi-thread copy program for PostgreSQL this program is a prototype don't use in production environment pseudocode is parallel simple right but this certain endpoints are not exactly in one in end of the file size this is because the copy materials has a variable length of one row so we need to adjust the end of the material to be processed and then adjust the next start position accordingly it creates enough threads to parallelize each operation with these adjust the start and end positions and each thread release a satisfied amount from the file to perform the copy operation this is performance next sorry ok this is the data flow in pgm copy start right and left from right to left i asked chet gpt to write this in c code it's hitting and the resulting program is pgm copy of course it didn't work it didn't work as intended at the first attempt chet gpt wrote about 95% of the code and i did some work with bubbling and exception handling to make pgm copy use memory efficiently here is the execution history the final step took 25 seconds it was twice as fast as of pgsql also i did speed comparison based on the number of threads this is comparison chart a single thread took 36 seconds meanwhile 16 threads took 14 seconds the test machine is like that it's very fast as you can see from the chart when i loaded the data by multisread the inputting time was faster even though it was a single table comparing to a single pgsql copy operation i could see that it was more than 4 times faster from 16 to 15 so far we haven't considered the table's primary care index in production it is extremely rare extremely rare to use a simple table with no attributes like that let's think about it how to input data in production environment we need to do 3 works the first part is the SQL script that creates a table with no attributes and the second part is to use pgm copy and last the script create constraints and indexes of the table it's very annoying also when you do this alter table or create index maintenance workman and maxpanel maintenance workers should be tuned appropriately to speed up 64 megabytes or memory and 2 workers a deeper value in a typical enterprise database situation this is too small especially for large table management task like this I change them to 6 gigabytes and 6 workers and faster 3 minutes to 2 minutes because PostgreSQL has the ability to create index in parallel to concludes my presentation I'd like to make the following suggestion to PostgreSQL developers parallelization of the copy command is must if it cannot be implemented on the server side I would like to say I would like to see it supported in client program such as pgsql, pgdump and pgstore plan 1 this is for pgsql create a new built-in command called pgm copy command is for output above and below command is input okay the parallel pg restore or pgdump program is lovely thing thought that's all I have prepared for you today if you have any questions email to me please email thank you for your patience do we have any questions okay simple question I didn't understand do you use threads do you utilize different different connections or you share one connection by multiple threads share one each one separate connection okay and keeping into account your long time experience does it make sense to use threads inside back end for example implementation in plan nodes parallelize scan for example parallelize join with threads does it make sense or it's senseless hello everybody my name is Ilya I'd like to share with you about my new experimental extension of postgres pages that advisor that can manage your statistics in postgres so my presentation consists of for these topics I'll I'll tell you about basic and extended statistics how can you update them and especially I told you about my new extension adjusted advisor so let's start off after welcome demon perhaps everyone knows about that but I'd like to repeat this about this demon in the context of statistics all features of after welcome you can see in this table but in the context of statistics perhaps you know that you can update statistics on your table by analyze command but by vacuum you can also update some attributes in the pager class table that is stored in system catalog which attributes can vacuum can update he can update rail temples that means a number of rows on the table rail pages is stored a number of tables that data is stored in your table and then rail all visible is a number of pages that can see all transactions can see the pages so the vacuum can do it but analyze command cannot update rail all visible so if you want to like to update this attribute you should execute vacuum command so if you want to manage after welcome in the context of statistics you can use these group parameters of after welcome the after welcome analyze scale factor and after welcome analyze threshold so if you reached to some update update and deleting or removing table on your table analyze is executing let's see my new experimental extension on the simple example we created the table with two columns but we for this table turn off for the after welcome you can use by this command so we created the table and then we'd like to insert rows in your table but in order to insert in order not to insert a simple or trivial rows in your table we we fill this table with data with low cardinality that means if you the cardinality is a variety of unique rows in your table but to a number of rows in your table so we can insert rows by this in the next command we insert this data and what next as I said in pager class table I stored these three attributes and as you can see there is no data on these attributes because we don't collect statistics on this table so as I said if you want to update statistics data on your table you should execute no you have to execute analyze command for attribute row tuples is a minus one value means that you don't collect statistics at all but after analyze command you can see that row tuples and row pages are updated but row all visible doesn't but after vacuum command this parameter is full so we have another parameter for managing r2 vacuum and collecting statistics is default statistics target this go parameter can control the details of collecting statistics it means if you have a big number of this parameter the statistics will be collecting a very number of data and you should know that if you want to collect a lot of statistics the common statistics will be executed slowly and can be less effective we we considered the basic statistics but we have at all tables and at all columns of the table the statistics are null frag this is the the ratio of no the the number of null values on your rows and their frequencies average width is the average value of your column and this thing is a number of different values on your column common values and common frags is the most common values and their frequencies and as you can see in this example the length of histogram bounds is almost the same as default statistics target ok let's move on now to extended statistics as I said that if you want to collect on one column yes you can do it but if you have some functional dependencies with two columns on the table or you can or you you want to consider statistics on two or more columns you should use extended statistics and there are three kinds of statistics there is a dependence and distance and the most common values MCV you can create extended statistics like that and what is it for yes a simple note if you executed create statistics command the statistics is stored in pgstats table and there is no data why because you should also execute analyze command and if you if you don't do this you have no data but after doing this you can see that extended statistics is stored some data and we execute create statistics by dependencies as you can see the first column depends on the second ok we consider a simple example of creating extended statistics and let's drop this data and i'd like to consider why we use the extended statistics i guess you know that the planner should know about statistics on the table and if we point that we have some dependencies between two columns we can help to the planner to be more effective so as you can see we have the estimated rows that shows how a planner works in compare with in the reality as you can see that right now the planner is predicting the wrong rows and after creating extended statistics and don't forget to execute analyze command as you can see our estimated statistic estimated rows is similar as the actual rows ok how is related to my new experimental extension i wanted to suggest or advise some comments that can improve your planner especially the estimated rows and suggest and sometimes suggest executing analyze command when your statistics is old after some big or high inserting rows that's why i wrote a suggested advisor extension and how to use under what's the consist of this extension is a simple hook that we turn off in explain command and you can use this extension either by load command or you can write in the main configure of posgres posgres.conf and this extension has two main go parameters the first analyze scale factor that show you when you when my extension suggest executing analyze command how does it work i show you the following slide and suggest statistics threshold that help to suggest creating statistics so let's consider our simple example let's drop statistics and insert the same data that we inserted earlier and let's consider the first group parameter analyze scale factor so what happens if we set this parameter this parameter does the following if i guess you know that in posgres where the table that is called start all tables that contains information about your table and columns with about autofocus information and we can use this information and what since analyze is the attribute of table table sorry and that shows that how many rows that are not proceed by analyze command and lift up is the number of rows that that are possible for all transactions so we divided one parameter to another and if it is greater than our group parameter after executing explain analyze command and our select that we considered at previous slides we can see that the planner give us a message about we should execute analyze command and of course our submitted rows is low in comparison actual rows so the message is appeared and you can run this command but after that if you use the next group parameter of my extension that is variety of estimated rows to actual rows if it is less than my group parameter the next suggestion message is that you can create statistics the main feature of this extension that so you don't execute this command directly it's just suggest to execute it and you don't have to execute them so because of our estimated rows divided by actual rows is less than our group parameter this notice is appeared and after creating our statistics and of course execute and analyze there is no suggestion messages and our submitted rows is the similar actual rows so that's all that I want to say that's my cure codes to my social media and this presentation and source code of my extension is my github page if you have questions I can perhaps I might answer them ok that's all any questions from the audience yeah nice presentation Ilya what's the roadmap for the extension where here here or what's the roadmap the future roadmap what are you going to extend on yeah next set of features going to get into the extension ah the features yeah this yeah I just show you this slide because I want to tell you about the main features of after vacuum and explain in the context of statistics only on statistics and this is only the main features but perhaps there are a lot of features in after vacuum and I think that's all thank you any further questions yes thanks do try to avoid false positive recommendations or maybe you resolve this problem somehow what do I mean sometimes in postgres we have a lot of optimizations when we don't read tables totally for example if one side of join is zero you don't read an hour or if you have merge join you can't need to read all the data from the data good question I have forgotten to tell you about this my extension just works with only select comments not join, not merge I think I am developing this so this extension works with only select comments with word filter you have joins you doesn't provide any information I try to develop it and it recommend useless messages unfortunately any further questions from the audience well not then let's thanks the speaker again yes thank you we have a break now until 11 o'clock please take the time to take a coffee we are continuing our PG day with the next talk the speaker is going to introduce himself but this is the keynote so thank you for coming I'll I'll let the microphone to Gary thanks it's great to see so many interesting talks at the conference here so I really appreciate you coming along to this talk when I was asked to do the keynote is a little challenging so what to choose is subject and I thought at these conferences opens it's very easy to end up getting involved so technically that it's easy to lose side of the forest for the trees sometimes so taking that into account and the number of students and it's great to see so many students at this event embarking on their journey in technology I thought it would be good to talk about a couple of topics the first one was what you can get out of being a member of the community quite often we talk about this from the perspective of technological innovation generate by open source and we also talk about what community means to their community when they participate but we don't often talk about what you can get personally out of being a member of a community so I thought that would be a really good thing to talk about with so many people coming out of university and perhaps looking to join a community my name is Gary Evans as he said I've been working at Fujitsu can you make it a little closer a little closer is that better so I've been working at Fujitsu for a number of years and on projects that range from the very commercial to very open and it's given me a bit of a unique perspective over the years so the second part of the talk I'll talk about that symbiotic relationship between open source and the commercial organisations that rely on those projects I think over the over the last few years there's been a change in that relationship the dynamics have changed a bit and I think that comes down to one factor in particular so I'm going to talk about what that factor is and how it changes that relationship and what that means to you and your involvement in a project I thought I'd begin by telling a little story from my book of Data Tales that's about PostgreSQL so I'll read this out so once upon a time in the vast land of database kingdoms there lived a wise old database called PostgreSQL PostgreSQL was known throughout the lands for being a reliable powerful database and most importantly its generous spirit of open source one day young developer embarked on a quest to find a database that was suitable for its project that he intended to use and traveled through the valleys of Oracle across the plains of my SQL and over the hills of Wangok DB each database offered him promises and gifts of speed, scalability and simplicity yet the developer felt something was missing he finally arrived at the gates of PostgreSQL why should I choose you PostgreSQL over the others he asked and PostgreSQL with a twinkle in its process said well young developer unlike others I offer you not just performance but wisdom not just transactions but trust and not just data but a community and for every query you ask I shall provide an answer and every problem a solution but most importantly I offer you freedom or I am open source and the developer intrigued decided to give it a try and as he worked with PostgreSQL he decided to give it its true power not just in its features and community around it impressed and inspired he knew he had found his database years passed and his project were very successful and he was at a FOSS Asia conference one day and someone asked him why did you choose PostgreSQL for your project and he said well PostgreSQL didn't just store my data it captured my heart and I think and I guess I like the story and wanted to present the story because I think encapsulates the essence of PostgreSQL and probably other open source projects as well in that it's not just a reliable powerful database but it's a community driven project that values collaboration and freedom and that's really important open source projects thrive on their communities and I think PostgreSQL success from stronger communities out there and the fact that we have a PG Day sub-conference event as a real testament to the community behind PostgreSQL community collaboration I think their core values of open source projects and their key in the success of projects I don't think I've ever seen a PostgreSQL project an open source project that's been successful without having a good community behind it and the collaboration and the relationships that you make at this conference are what are going to build the success of those projects we see here and emerging projects that we're going to see next year and the year after at this event the concept of open source is a driving force and technical innovation so I didn't want to talk about that today but if we think about it there's some very well notable characters here from various technologies back to 80 years ago with Richard Stillman with the Free Software Foundation and also Eric Raymond wrote the essay The Cathedral on the Bar must be getting on to 20-25 years ago I read that so I don't know the new generation coming through whether you're familiar with that but it's certainly worth having a read to get an understanding what open source is all about so rather than talk about technical advancement or what you can do for your community I'm just going to talk a little bit about what you can get out of being a member of an open source project the flow of knowledge and skills and ideas also provides members with opportunities for personal growth so leading up to this conference I was thinking about what motivates people to be members of open source communities and I came up with five so I thought we'd do a little bit of audience participation and get a feel for what are the big motivators for people here to be a member of the community or even thinking about joining so the first one I came up with was professional growth and development joining a community gives you the opportunity to work with cutting edge technologies learn about new frameworks learn about new languages learn about the best practices associated with things also gives you the opportunity to learn soft skills like leadership skills effective communication and that type of thing so that's the first one I came up with the second one is networking and community engagement some people love networking opportunities and connecting people with people from all over the world with like minds and like thoughts and it provides opportunities I think also to advance your careers as well there's also that warmer and fuzzier side giving some people a sense of belonging and identity so that was number two to have a think about whether that applies to you and then personal fulfillment and empowerment was the third one I came up with I think developers love to express creativity and innovate that's what they do best I'm not really a developer anymore I do more management activities but I dabble now and then and I wrote a bit of code the other day using combinations and regular expressions and some of the generated column capability and PostgreSQL and it was pretty cool I really enjoyed it so I could really appreciate the bars that members of the community obviously far smarter than I get out of some of the contributions that they make to these projects and we've seen some great examples of that at this conference there's been some really cool stuff so I look forward to seeing some more and then fourthly there's that feeling of contributing to the greater good there's a lot of projects out there that you can belong to where you can feel like you're making a real life impact on society and different communities a couple that I think of at the top of my head was the open MRS Medical Records Activity and ACFO which was management tools that provide some transparency to avoid fraud and humanitarian projects and I'm sure there's plenty of those projects if that's a motivator to get involved with and then lastly money of course there's a lot of organizations and sponsors here like AWS and Fujitsu itself we pay 7 or 8 developers to work full time just on the open source PostgreSQL project I think recently they're doing the logical replication feature they're just going to be talking about that next and how that works so they're the 5 I thought of so I thought let's do a bit of a show of hands for each one and just see which ones you feel might be a motivator for you because to be honest I've got no idea I know what applies to me so if I read through each one individually if you put your hands up I'll be very disappointed if no one does it's a professional growth and development who thinks that would be a good motivator for them to join a community ok I've got a few hopefully networking and community engagement is that someone same people that's great and personal fulfillment and empowerment looks like all of these equal motivators I suspect I suspect maybe everyone's going to put their hand up for money but a contribution to the greater good who feels ok there's a few there and money ok I've got to be honest I didn't learn anything there but that was fun it was good to see thanks for your participation the second thing I wanted to talk about was a change that I've noticed over the last few years of my role I was looking through some diagrams like ecosystem diagrams product ecosystem maps product landscape maps and even technological landscapes and they've grown significantly there's a lot more organizations and products appearing on these maps with links between them that's quite significant I think and I think diving deeper into the landscape map there's an emerging trend there where commercial companies aren't really just looking for particular ownership of products and selling those products but they're actually embedding themselves very deeply inside partner ecosystems I know looking at the release of some software recently there's less features coming out and more integration with other products out there enabling them to embed themselves within this ecosystem where they can provide customers with a product that provides a far, far greater well, that meets their needs much better across a greater landscape of things so it's really about amplifying the value to the end customer and having all these different companies contributing their unique strengths and together coming up with a sum that's greater than all the parts that make it up these integrated solutions I guess what I'm trying to say is that they address a wider range of customer things so I'll give you some examples have you noticed how a lot of large tech companies have been investing in AI startups at the moment a lot of cloud service providers that are coming together and even cyber security companies large financial companies large tech companies sort of buying them up and partnering with them I think I read last week that Amazon made a $2.7 billion investment in AI startup anthropic so that was interesting to see and I think there must be about 60 or 70 AI startups so I was reading about the other day and interest them in them is really exploding I guess the reason for this is that enables them to offer some very comprehensive solutions that meet the complex business needs of today and it allows a seamless experience to customers integrating everything from data analytics to storage solutions that fits neatly into customers existing infrastructure and meets the needs that they have so I guess it's not just commercial companies that this benefits it benefits open source organizations as well PostgresQL I think really benefit from this over the last few years we see PostgresQL in a range of cloud services with RDS Google Cloud SQL and Azure for Postgres and those organizations add a lot of other services to it making the PostgresQL technology a lot more accessible and easier to use and have value to it then we also see integration with development environments DevOps practices and their tools encourage tight integration with databases so Kubernetes for example crunchy data and Fujitsu have both got very good operators running PostgresQL in a containerized environment simplifying things and making it much easier to run Postgres in that environment and really that aligns databases with architecture we're seeing like microservices and CDPOL continuous deployment pipelines and then another example is in data decision driven decision making tools Fujitsu is very good at participating in data federation type services with its foreign data wrappers to so many different data sources there's also a lot of integration with platforms like Kafka Elastic Search and that type of thing that makes it a fantastic base for that we heard about PG Vector yesterday that allows integration with a lot of AI tools by storing embeddings and things within it so it's just getting stronger and stronger in that area I was talking to our Fujitsu Vietnam team over the last couple of days while I've been here they've invested heavily in this area and they've got an open data platform down there that's built on top of Postgres QL technology so it's worth going down to the booth there and having a look at this particular use case so I guess overall it's about Postgres QL has been very successfully actively shaping these technologies by providing a reliable database technology to build them on that fits very easily into these cloud services that we see today these DevOps architectures and the broader data ecosystems that we see emerging so what does this mean what's the key here I guess for open source communities I think thinking about it maybe we need to start looking at how to engage with potential partners not so much Postgres because I think that's already happened but perhaps exploring ways to make sure our open source technology that we develop integrates with those commercial solutions embedding ourselves within them so that could mean something like developing plugins for popular commercial products would be a good way to go ensuring compatibility with widely used standards or even collaborating on joint ventures that combine both open source and commercial software so embracing the ecosystem oriented approach whether it's commercial or open source I think in the end what it does it unlocks new value to end users and customers and provides a sum much greater than the individual parts I think just finishing off if we look at PostgresQL in particular and how much the landscapes the PostgresQL projects in a fantastic position for the future it's embedded in all of these different landscapes from clouds everything else so I think it's going to be there for a long time and for those of you and you leaving university I think considering getting involved in the PostgresQL communities or any community is great but the PostgresQL community is a great community to join and learn from being a member of before going out and looking at some others so I encourage you to get involved in that so I'm going to finish here hopefully I've provoked a few thoughts around this area for you to go play with and sort of taking you out of some of the the lower level technologies we've looked at made you think about open source at a higher level before we dive down to some of the details in our next session and I'll hand over to Nishke, my colleague to talk about logical replication so thanks listening and enjoy the rest of the conference alright thank you so hello everyone welcome to this session so in this session I'm going to talk about logical replication inside the PostgresQL 16 and 17 so before start the agenda of this session I would like to introduce myself my name is Nishke Kothari and I'm working as a technical consultant in Fujitsu Singapore and I'm the part of the Fujitsu Enterprise Postgres Global Center of Excellence team so this is the agenda of this session and next around 17 to 18 minutes I'm going to talk about all of these points so introduction so I'm sure that who have already working with the PostgresQL are fully aware about what is a logical replication but I would like to quickly cover the introduction for the newcomers so logical replication is introduced in a PostgresQL 10 and it is a replication technology to replicate the data changes from one server to the other server by using the concept of the publisher and subscriber it is going to be the data is going to be replicated from publisher to subscriber by using the concept of the replica identity generally it's a primary key for the table but sometimes according to the business requirement we are not going to create the public primary key for those tables then we can set the replication identity full for that particular table so we will be the part of the replication as well. Logical replication will be able to create within the two different version of the PostgresQL as well but make sure that the PostgresQL version must be greater than equal to 10 otherwise it will not possible to configure because it is introduced in PostgresQL 10 onwards. One publisher server can also replicate to the multiple subscribers and all the subscribers can have capability to accept the right operation if you require to design your application in such a way and data is going to be replicated from publisher to subscriber by using the concept of the logical replication slots and additional replication slot is also going to be created while performing the initial table data synchronization from publisher to subscriber and once synchronization is completed successfully then all the additionally created replication slots is going to be deleted automatically. I must say that entire logical application is not a resource intensive if we are comparing with additional trigger-based replication called SLONI. So PostgresQL 16 introduced last year and I would say it is perfectly stable and perfectly usable for our production environment as well. So next couple of slides I would like to quickly cover the newly enhancement logic replication related enhancements introduced in PostgresQL 16. So I would like to start with the PostgresQL 9.5 So inside the PostgresQL 9.5 the replication origin the concept is introduced for logical decoding framework and that allows the application to mark label mark the mark and identify the certain aspect of the logical decoding sessions as well. So now let's say if we are talking about the logical replication so let's say the architecture like say for example you have a logical replication architecture server A is your publisher and server B is your a subscriber server then in that case if we are going to insert any number of rows from server A is going to be replicate to server B. That's normal ideal architecture but now let's say you have additional requirement you would like to create some sort of bi-direction replication where you are going to create a replication from server B to server A as well for the same table which you already have a logical replication from A to B. Now in this case if we are going to add one row from server A is goes to the server B and the same row is trying to replicate from server B to server A so it is going to be a infinite replication loop among all the replication nodes. So now how to handle this? So what to handle this kind of scenario? Community come up with the new subscription parameter called origin. With the help of origin it will handle the replication cycles among all the replication nodes and as well as also instruct the publisher to send only those data changes to the subscriber which are originated by the SQL commands. So that is a mechanism to break the infinite loop of replication between the replication nodes. So this particular parameter have a two value one is any which is a default value and another is the none. So if we create the subscription with the help of origin is equal to any subscriber will inform request the publisher to send all the data changes irrespective of the origin of that particular data whether it is created by the replication command or the SQL commands. But once we set the origin is equal none while creating the subscription that in that case subscriber will request to send only those data changes which is created by the SQL commands. That's the benefit. So we can achieve this some sort of bi-direction application. Definitely there is no SQL sorry no DDL and sequences are there. There is no logic of the resolving the conflict as well. But still we can able to achieve some sort of basic bi-direction application. Next one is the large transaction handling mechanism. So till the Postgres SQL 15 all the large transactions are going to be applied a subscriber site. So the logic behind that it will goes publisher will send all the data changes into the multiple stream by dividing the large transaction inside the multiple chunks and at the subscriber site, apply worker will read all the small data changes and write it inside the temporary files. Once the subscriber is going to receive the commit then it apply worker will read all the temporary files and entire apply the entire transactions to the subscriber site. So now to handle this it is a time consuming and it also reduce some sort of performance overhead. So it will increase some performance overhead. So to handle the overall mechanism community come up with the new subscription parameter called streaming. With the help of that we will be able to apply the large transaction parallely. So what happen when so when we create a subscription with streaming is a parallel parameter. Then large transaction is going to be assigned to that particular transaction till until that worker is going to be assigned until the transaction is going to be finished successfully and we can control the amount of worker by modifying the parameter called max parallel apply worker per subscription and this parameter is introduced in Postgres SQL 16 and available inside the configuration file of the Postgres SQL overall with this particular new enhancements 25 to 40% performance benefit is observed. Next one is the use of the indexes. So if you remember during the introduction slide I have mentioned that tables is going to be replicated based on the replica identity. It could be a primary key generally but if it is not possible then in that case is going to be replica identity we need to set up full. So now with this enhancement we are able to use the indexes as well but index must be a B3 index not the other indexes the other indexes if I say like the partial index or hash index or any of the green or GISG any kind of indexes. Not supported at this moment only B3 index is going to be supported at this moment. Next one is the logical decoding support from the standby. So when I say standby it means the streaming replication primary server and the standby server so generally I'm sure you all are aware about that we cannot be able to perform any DDL and DML operation at the standby server and when we try to perform it throw error. But from this particular postgres equal version onward if we are going to create the primary server and standby server with the wall level is equal logical parameter then in that case standby will allow to create the logical replication and with the help of this particular enhancement when your server is fully loaded primary server is loaded and you would like to perform the workload distribution then in that case enhancement is going to be more beneficial with this enhancement we can instruct our all the subscriber server to subscribe themself with the standby server instead of primary server. So with the help of this we can reduce the completely workload of the primary server and next one is the allow the non-super user to create the subscription so till the postgres equal 15 only super user is allowed to create the subscription but now with this enhancement non-super user is also able to create the so subscription at the subscriber server but we need to make sure that the non-super user who is going to create the subscription must have a PG create subscription privileges otherwise it will not be allowed to create the subscription by the non-super user so next two slide i'm going to talk about the catalog tables and views which are going to help you to monitor your logical replication so let's start with the PG state replication so i'm sure all of you are aware about that the PG state replication is useful to provide the information about the replication status and yes it is common for logical replication as well as streaming replication so that is no different view or table for this but logical replication monitoring PG publication PG underscore publication and subscription so these are the system catalog tables which are responsible to provide the information about the subscription and subscriptions and publications PG publication rail and subscription underscore rail so this catalog tables are useful to provide the information about the mapping between your tables and respective publications and subscription next one is the PG replication slots so this particular view is going to provide the information about the all the replication slots along with its status which are available which exist inside your current database cluster PG publication namespace this particular catalog tables is introduced in postgres equals 15 and provide the information about the mapping between your schemas as well as between your schemas and publications PG state subscription is a statistic view useful to provide the information for the logical subscription workers information and last one is the PG state subscription state it is also introduced in postgres equals 15 and provide the information about the logical replication subscription errors this one the postgres equals 17 so in this slide i am going to talk about the possible logical related enhancements which we are expecting to be part of the postgres equals 17 it is not confirmed yet because release note is still not available for the general use so these are the expectation to be a part of postgres equals 17 so let's start with the first one to allow the use of the hash indexes so as you remember that in the postgres equals 16 i have covered that indexes primary key as well as we can set the replica identity full now with the 17 onwards we can also able to use the hash indexes as well this is the additional benefit for the users who would like to work with the logical replication and work with the tables which do not have the primary key next one is a new column worker type added in the PG state subscription so as we have seen in the last slide PG state subscription is useful to provide the information about the subscription workers so with this new column we will be able to track what kind of operation is going to be performed by the particular subscription process and the possible values would be apply parallel apply and the synchronization these are the three possible values for this particular column next one is the allow upgrade of logical replication note the most interesting i would say the enhancement which we are expecting in the postgres equals 17 generally what happen when we are going to perform the version upgrade major version upgrade of postgres equal in your logical replication setup by using the PG upgrade utility then in that case you will not be able to migrate your replication slots from the older version to the new version at the publisher side and at the subscriber side you will also not able to maintain the full full state of the subscription set at the subscriber side and at the result after the successfully upgrade of your major version we need to create all the replication slots inside the publisher server as well as also redefine the subscription set at the subscriber side which leads to the initial table data synchronization as well so this is the time consuming and the west of effort so community enhance this particular functionality and now with the help of this over all the replication slots also going to be migrate from older version to the newer version and subscriber also able to maintain the preserved full state of subscription so there is no requirement of recreation of the replication slot as well as there is no requirement to redefine your subscription set but we need to make sure that to get the advantage of this benefit our publisher and subscriber must be created for poster SQL 17 version or later version as well next one is add a log messages about the replication slot so till the poster SQL 16 over all the replication slots details are going to be recorded inside the pg underscore rpl slot slot name directory and all the information are stored inside the binary format so which cannot be visible by the human but now in case in case due to some any reason our database server got crash then we will not be able to identify that how many number of replication slots were active or inactive at the time of crash and there is no alternative way to identify the metadata of the replication slots so in that case we will lose the complete information about the replication slots so now new log message entry is going to be added inside the poster sql.log which is going to provide the statistics information about the replication slot as well and the last two very self-explanatory and I must say that it is a long most awaited feature which we are expecting to be part of 17 but still the code is under review by the community members so we are not sure whether it will be the part of that or not so that's it from mine if you have any question please feel free to do it one question we all utilize extension models I guess you use a project plan adaptive query optimizer so on and we use the same and he has a lot of extensions under the hood what do you know maybe what do you do in the direction of replication of extension data data of extensions the internal state because now we have problem we have extension on on-prod and quite different extension different settings on replica we need to synchronize the states what do you do for us correct me if my understanding is incorrect your question is looking for the data which is stored by the specific extensions which is created by the create extension command something like that so again logical replication will not handle any DDL so you have to create the extensions from publisher to subscriber manually as well as all and if you would like to synchronize your extensions data as well if it is keep updating on a regular interval then you can create a publication and subscription over there as well and if it is a one time it's like it's a fixed data it's just like a metadata hello so then you can also able to just restore it at one time as well I guess we need some general technique to just give API to extensions to allow them to run the data because it's not uniform it's not just a table sometimes it's maybe a file so maybe you work in this direction to provide some API so again so if you see it's a purely database related replication so if you are talking about some files as well so files definitely will not be a part of replication as well because it's purely a logical so it's purely based on reading the logical wall transactions so what are the transactions going to be performed on a publisher will record it inside the wall file logical decoding will read record by record and apply at the subscriber machine so file system level definitely will not taken care by it can be taken care by I would say the streaming application hi thanks great information and I have question that is it really possible to have logical replication within the same database so maybe we can have two tables and within the same database and then is it possible that can replicate the data so let me understand your question table 1 is going to be replicated on a same database as a table 1.1 something like that it is also possible but I just want to understand that what is the use case for that for a demo that they want to know that is it really possible or not yes it is possible as I mentioned the publisher, subscriber you need to create over that you need but you will not be the same name you need to be give a different name ok last question sorry my skin English not very good ok as far as I know version before 16 or 17 do lot support to replicate the search as AutoPale Over lot balancing so the question for community version 16 and 17 do not have replication or high availability to failover in term of master failover so cannot control the passive one I keep fail and cannot control the passive by couple of so do you have any so the certain can do so a logical replication which slide you are talking about just show me this slide you are talking about this slide which slide you are talking about ok so the thing is like a logical replication is not the it's like both can server can able to accept the right operations so publisher can also able to perform the read and write and subscribe also can able to perform a read and write but it will not bi-direction replication ok so your application needs to be pointed on the one server which is called a publishing server so all the transaction is going to be in case it goes down due to some any reason then you can just switch over your connection string and you can run your all the business from a subscriber server so there is no need requirement of failover from the database point of view because both are active and both are able to read and write active active it is active active but application will point to the only one server because there is no bi-direction replication so once it is down so first of all we need to achieve the business continuity right so there is no concept of failover in logical replication so it is just need to be taken care through some jerry basis string or something like that if one server is not responding then the connection requires redirect to the other server because other server is already up and running with the read and write mode so just need to switch the connection string yeah just a switch over not the database only connection string connection string maybe the question if there is any open source proxy or a failover tool that you can actually recommend there is something called as HAA HAA proxy and people like so multiple things you can do ok you can go with some BG pool will also help you to achieve those things as well ok apart from that but I would suggest if you don't want to rely on this third party you can directly rely on the jerry basis string with the help of libpq you can also achieve those things yeah definitely you can even visit the boot as well you can visit the boot we can answer all the questions as well yeah come back to fijitsu boot so you can have a discussion also is it ok ok we move on with the next talk yes let's let's move on to the next talk before the lunch break can you pull that up hello last talk before the lunch ok that works now check one two now works yes now do it ok now with the security we can continue hello guys in this session I'm going to talk about the advanced security strategy for the postgres so so I'm just going to skip my introduction because you're already aware about that ok so now this is the agenda of this session and next round 18 to 19 minutes I'm going to talk about all of these technical points so I'm sure all of you have heard about the data bridge and as technology is evolving day by day then the occurrence of the data bridge is also increasing day by day so according to the latest data bridge report by IBM and Ponemon Institute in 2023 the average cost of the data bridge has reached a record high of 4.45 million USD which is almost 2.3% high if you're going to compare the data bridge cost in 2022 in 2022 the cost was around 4.35 million USD so I must say that data bridge incident is one of the most embarrassing situation for any organization who become the victim of that so data bridge is not only brings the embarrassment for the organization but it also brings certain negative impacts like the damage the brand reputation reduce the customer trust revenue loss and all of this also impact the stock price as well as stock price also going to be fall so that's the reason it is one of the most I would say the very bad for the organizations in terms of other aspects as well you can see on a screen that there are the various organizations and government entities who become the victim of data bridge in the recent time so there are the various security techniques which you can integrate with the PostgreSQL and you can save your environment or also the database from the certain security threats like the privilege misuse of your privileges SQL injection physical file theft and many middle attacks on many other security threats as well so in this slide I'm going to talk about one of the most powerful database security tool which is going to provide a complete safeguard to your sensitive data called transparent data encryption or TDE so TDE is act as like a dedicated security guard for your database and it's going to be work around the clock so when I say it's going to work around the clock so it means depending on the company requirement it will create either 128 or 256 bit of encryption shield around your data files so it provides complete data at rest security so in case attackers somehow able to manage and break your database server and try to access those files they will not be able to see the actual data they will see only the scramble of codes and that is the reason that TDE is one of the widely accepted techniques by all the organizations who are mainly dealing with the PCI DSS standards so if you would like to deploy or I would say configure TDE in your environment there is no extra steps is going to be required or I would say the complicated steps are required and there is no modification or rewriting the existing application logic once you apply or I would say once you configure the TDE in your environment you will and in case if you have a very strong security requirement and you would like to rotate your encryption key on a regular interval you can also able to do that without any complication and without any requirement of the re-encrypting your existing data with your newly rotated encryption key and I must say that TDE is working very seamlessly with your application because there is no performance degradation is observed when you are going to configure TDE everything is working as it is without any performance degradation so TDE will come with TDE will encrypt everything which comes within its scope so let's start with the table space so if you created TDE encrypted table space in your environment then any database object is going to be created inside that table space is going to be automatically encrypted backup files and temporary files TDE will encrypt all the backup files as well as temporary files as well so hackers will not get any single benefit by accessing those files wall file according to the internal architecture of the PostgreSQL all the successful transaction going to be recorded inside the wall files to make it available for the instance crash recovery so and nowadays utility also available which is responsible to dump your wall file into the human readable format and TDE will restrict that particular things by encrypting every single file of every single wall files so hackers will not be able to dump those wall file into the human readable format and your data is still secure in this slide I'm going to talk about the another I would say the another security tool which is going to come provide a complete safeguard for your sensitive data from the praying guys call the data masking so it is just like a putting a mask on your sensitive data so to implement the data masking in your environment you need to identify the column which are responsible to store the sensitive data in your database once you identify that then after that you can apply any of the technique like encrypting replacing or randomizing which is useful to scramble your original sensitive data into the non identify or I would say the completely scramble form so once you apply this any of the technique your data is already your sensitive data is scramble so it is completely suitable for the activities like testing, development or analytic purpose but if hackers got the access of that is completely useless for that because it's original data is scramble so they will not get any benefit this particular technique is also useful to provide the protection from online and offline both the mode so as I mentioned in my previous slide that data masking is just like putting a mask on your sensitive data so just like when we wear a mask then others can see our eyes and able to hear our voice but they will not be able to see our full face so data masking is also work in a similar concept and also have a few masking types as well so let's start with the full masking so full masking is just like putting a super hero mask on your sensitive data so so it will convert your every single information every sensitive single information from that particular column will transform into the completely new scramble form so just like turning Peter Parker into the Spider-Man on every single place where it is visible partial masking is a technique to hide the certain part of your sensitive information while the other part of sensitive information is still visible to the other that's called the partial masking and I'm sure you have all of you have this kind of partial masking in your day to day life for example when you try to go for the online shopping and you already have already saved your bank card as a preferred payment term that you have observed that last 4 digits of your saved card is visible initial 12 digits are already masked that's called the partial masking next one is the regular expression masking so regular expression is like it is a technique to mask the only those information which is going to be based on a special pattern while by keeping the other all information is going to be in a well organized manner so for example if you would like to display the address in the address building number, building name is going to be visible to all the users but the street name only street name is going to be masked which is going to be full fill your masking policy while the other street name is still visible to the others which are not going to full fill your masking policy so that's called the regular expression masking you can see on the example at the bottom of the slide so auditing auditing is audit logging enables the organization to trace and audit the usage of sensitive data and also logs the connection attempt of the database as well as also provides the additional information for accessing data by logging the certain information like what access, what data is access who access the data, when the data is assessed, how the data is assessed so all the information can be possible to log inside your audit log files so depending on the business requirement administrator can able to configure the database object sorry certain events or database objects which they would like to get some audit information so all the audit information is going to be logged inside the dedicated audit log files which are completely separate from the postgresql.log files for the easy access and management point of view and there are two types of audit logging one is the object sorry session audit logging another is object audit logging so both type of audit configuration if possible to configure by the administrator of the organization session audit logging is going to log the certain events like postgresql service start, connection attempts backup operation, maintenance operations and few others while on the other hand the object audit logging is going to be log the certain operations like all the DML operations which are going to be performed on the specified database objects but object audit logging is going to provide the some more granular level of audit information the actual goal of the auditing is to provide the organization the capability to generate the database logs that i would say the logs which are going to be comply with the various government and financial standards like PCI DSS and also helps the organization to i would say obtain the ISO certification if it is required so you can see on the screen that there are the list of the class name which we need to configure inside the audit configuration files and based on the class name configuration appropriate statement is going to be logged inside the audit log files so in this slide i am going to talk about the one of the most powerful postgresql extensions called pgcrypto which provides the complete safeguard to your sensitive data well within in your database so as i mentioned that pgcrypto is one of the most powerful so it also offers the various cryptographic operations also allows to encrypt or decrypt your sensitive data and also give a complete facility to the users to generate the various hashes and digital signatures as well and if i am going to talk about the capability of this particular extension so with the help of the various encryption robust encryption symmetric and asymmetric algorithm users can able to encrypt and decrypt their sensitive data and so this particular extension supports wide range of asymmetric and symmetric extensions for example like AES, Blowfish, RSA OpenPGP standards as well so with the help of that your data is completely secure already encrypted and it will not be able to decrypt by the attacker even though they are able to manage the break into your system but they require appropriate decryption otherwise it's 100% secure next one is the users also able to secure their critical passwords as well so PGCrypto will have a wide support for the various hashing algorithms like SHA256, SHA1, MD5 and with the help of that it makes sure that your password is completely irreversible and it will completely convert into the unbreakable string even if it is directly retrieved from the database last one but most important feature of this particular extension is it also allows the users to create the digital signature so it will work as a temporary seal to verify the authenticity and integrity of your data as well in this slide I am going to talk about how you can secure your data when it is moving means data in transist so poster SQL have an inbuilt capability called SSL means secure socket layer and it will work very well with the transport layer security means TLS so with the help of both we can able to make sure that our data is always secure when it is traveling between your computer and database server so with the help of SSL TLS we will be able to achieve the two type of encryption one is the public encryption which act as a fancy lock so only right person can open means only you and database what is being sent via network symmetric encryption is like a secret code it looks like a garbage to everyone without the appropriate key and if you want so if you want to configure SSL TLS in your environment certain things also needs to be in place one is installation of open SSL needs to be done on both client and server server key and certificate is also required to make sure that you are talking to the right place not with some imposter and certain SSL related configuration also needs to be done at the PostgreSQL level once you have all these things in place you are good to go and able to secure your data when it's traveling via network PostgreSQL supports both self sign and third party CA certificate as well so depending on the business requirement and security requirement we can choose any one of them to secure your data security patching security patching will help you to protect and keep your system healthy from the potential vulnerabilities or security threats so we should not ignore any security patching because it provides a complete safeguard to your environment from this security vulnerability which can be explored by the hacker to break into the system and can damage or steal your valuable data and it is highly advisable to it is always crucial to apply all the OS and database security patches on a regular interval as soon as they become available for the use and it is always beneficial to integrate your patching process inside your development process as well so successfully testing and implementation and testing of your security patch completed before it applies to the live production system last one is it is always highly advisable to automate your patching process with the help of automation if you are going to apply the patching in your environment you will get a certain benefit like you will save your time bring the consistency in your all the environment and also reduce the risk of the human error so these are the benefit if you are going to apply the patches via automation so that's it from my end if you have any question please please say the auditing cable is object and session level so are they pgaudit extension no it's a same it's a pgaudit it's a pgaudit but it's integrated with the for this enterprise post grace so it's a part of that so all the security features you have seen on screen right the tde data masking SSL configure all are available inside the pg so the pg enterprise post grace and if you would like to know more about your compliance level of your database structure or i would say the security based on the cis benchmark you can visit our booth so we will walk you through that we already have a security assessment service for that i have a question can i ask a question okay quick question do you use these features we already have inbuilt inside the pg we have a question over here so this is very interesting thank you so my question is how fast okay so compared to the no encryption one okay so we already have a trace result for that this is not the right platform to distribute the comparison so if you would like to know you can arrange the technical expert call with us we will walk you through that as well this is our platform to discuss about the performance between the one server to the other server no problem so my next question is do you use hardware accelerated encryption sorry hardware and accelerated encryption can you elaborate your question for that so the question is using the process of mnemonic okay so definitely so again this is again related to the benchmarking test so as i mentioned that is platform for that so definitely as i mentioned again you can book a call with us we will walk you through all those things okay thank you and boot also so we can have one more question so mentioning about security encryption, masking and auditing right i am testing with the masking bg anonymize no we have our own functionality to we have our own extensions to masking this we are not relying on the community we are also interested in the embedded encrypted something like that and also to combine with the regulations really the masking itself also combining to the regulations yeah so for this to enterprise postgres is purely compliance with the various data security standards like HIPAA and the GDPR or PCI DSS everything thank you very much we are supposed to let you know that the new version of FPP postgres which is compliant with the FIPS 140-2 okay i think we have one more question i am not sure i understand TDE correctly but encrypting data in shared buffers on a disk what do what do we do with some specific play like just statistics where we store some interesting things like most common values of your data or some caches for example you sometimes can spill data on disk sorting them or joining them or just double store which also can be spilled to disk you secure this with data or not so TDE is for the purely data stress security so it's a file system okay thank you very much let's applause for the speaker