Informatica性能调优(初级)

更新时间:2023-07-17 15:34:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

Informatica性能调优(初级)

INFORMATICA调优要点(初级)

原作者:Dan Linstedt

翻译:内森(informatica@)

创建时间:2004-11-20

最后修改时间:2004-11-20

相关文档链接:

Informatica性能调优(初级)

说明

z 本文在不经过修改的基础上可以自由发布,但不得用于商业用途,

译者保留中文翻译版权

z 本着理解至上的原则,本文对于难于直译的部分,采用意译的方

法,请读者注意,如有必要,请参阅后面的原文

z 由于译者能力所限,对原文部分语句理解可能存在误解(尤其是

红色所标识部分),欢迎指正。对于有重要贡献的读者,将在文档后的鸣谢中列出,以示感谢。

Informatica性能调优(初级)

译文

本文所列的诸要点都是对INFORMATICA相关产品进行调优过程中所涉及到较宏观的问题。他们不是放之四海而皆准的教条,也不是最终的解决方案。其中一些已经条目(尤其是已经进行过调优)的建议可能结果会有所不同。适用于某些特定条目的技巧由于其所要解决问题的层次不同也会产生不同的调优结果。

对性能进行测试的时候,建议使用20万条记录左右的数据源进行处理。使用比之更大数据量的测试数据源,可能会产生因为表的分区、删除和重建索引、RAID数据条带化等问题数据库相关问题导致性能下降的问题,而如果使用的数据源的集合太小,统计出来的平均处理时间可能会因为数据库吞吐量、主机负荷以及网络流量等因素的影响而变得不稳定。20万条记录的集合一般是进行准确统计的比较理想的测试数据源。

首先试着用如下的方法对MAPPING进行调优,然后进行SESSION的优化,然后重复这一过程直到对调优的结果满意为止,或者无论怎么努力也无法取得更好的效果。如果经过调优,性能仍然无法令人满意,那么整个处理的体系结构就需要进行调整(或者说改变MAPPING所要完成的工作)。如果是这样,你可以联系我们(译者注:指原作者),我们可以对体系结构以及整个系统自底向上的进行调优。

始终要记住:要想得到一个理想的运行性能,尽量的要使得系统中的各个部分到达一种运行的平衡状态,包括数据库、硬件资源等,让他们做各自擅长的事情。不同的体系结构可能在速度以及优化的可能性等方面产生巨大的差异。

1、利用数据库(例如ORACLE/SYBASE/INFORMIX/DB2)进行大量的数据处理操作(例如排序、分组、汇总等)。换句话说,临时表(staging tables)会对并行操作有很大的益处。在并行设计中,那些简单的数学计算总是会减少你的执行时间。临时表有很多的优点,详情请参考关于方法的讨论。

2、尽可能的本地化。将所有的目标表放到ORACLE的同一个实例中(相同的SID),SYBASE也是一样。对任何处理都不要使用同义词(远程数据库连接),包括LOOKUP、存储过程、目标表、数据源、函数、权限等等。远程连接的使用会使得处理变得很慢。对于SYBASE的用户,数据库的远程使用会使性能产生很显著的降低。

3、尽可能的使目标表、存储过程、函数、视图以及序列都放到数据源的本地。同样,不要通过同义词进行连接。同义词(远程数据表)可能会使性能降低3倍甚至更多。

4、减少外部定义的模块。作为替代,在pre-processing/post-processing中使用PERL、SED、AWK、GREP等功能。调用外部的应用编程接口(API)本身就会降低性能(比如1/1/2000)。幸好将来INFORMATICA在这方面会有所改善。能展现性能问题的外部模块是正则表达式模块(regular expression module),环境是UNIX系统:Sun Solaris E450, 4 CPU's 2 GIGS RAM, Oracle 8i and INFORMATICA,不使用该模块的情况下,每秒可达1500行,而使用该模块速度变为486行/秒,测试的时候没有其他的SESSION在运行。(这是一个特定的测试用例,不代表所有的MAPPING运行效果)。

5、时刻谨记INFORMATICA建议每个SESSION使用1~1.5个CPU。在这种情况下,INFORMATICA可以和关系数据库引擎在同一台机器上配合的很好,但是从性能优化的角度来讲,INFORMATICA与其他的引擎(包括报表引擎、JAVA引擎、OLAP引擎、JAVA虚拟机等)就配合的不是很好。

6、减少基于数据库的序列。因为基于数据库的序列生成器需要一个wrapper的函数和过程调用。使用这样的过程会对使得性能降低3倍左右。而且这样的速度降低也不容易通过

Informatica性能调优(初级)

debug来发现,它仅仅是在写入数据库的列时才引起速度的降低。先复制这个MAPPING,然后改调用上面提到的存储过程为调用INFORMATICA本身提供的内部序列发生器,这样的运行结果是这个MAPPING所能得到的最快的运行结果。如果你必须使用基于数据库的序列发生器,那么最好遵循临时表的使用建议。如果你处理的是GB级或者是TB级的数据量,这样会节省很长的调试时间。如果你必须使用一个共享的序列发生器,那么根据平面文件(flat file)建立一个临时表,加入一个SEQUENCE ID列,然后调用一个属性为POST TARGET LOAD的存储过程来生成那个列。把这个属性为POST TARGET LOAD的存储过程放到那个从平面文件导入临时表的MAPPING中去(译者注:这一句话与前面的一句所表达的意思相同)。数据库里面存储过程的一次调用,紧接着是一个分配序列的批量操作,是处理这用共享的序列发生器的最近的方法。

7、关闭详细日志。SESSION的日志会对整个MAPPING的性能产生极大的影响。在SESSION里面去掉“覆盖”(over-ride),将生成日志的属性设置为正常日志模式。不幸的是,在INFORMATICA的内部,日志记录并不是一个并行的机制,而是直接安排在操作的进行过程中。

8、关闭“收集性能统计”开关。如果开启这项功能也会对性能产生影响——虽然有的时候这种影响很小——因为它会把一系列的与性能相关的数据写入到性能日志中去。关闭这项功能会减少操作对与平面文件操作的依赖性。然而,在进行调优的过程中,开启这项功能又是非常必要的,它能发现一些reader和writer线程在速度方面的一些问题。

9、如果你的数据源是平面文件,使用临时表(参见本网站的临时表的相关幻灯片)。这样的话,你就可以使用SQL*Loader、BCP或者其他的数据库的并行装载的功能。只把那些简单的处理逻辑放置在数据源的加载MAPPING中,不要加入那些编码的查询和转换逻辑。如果在这种情况下,你的reader仍然比较慢,请从如下两方面进行检查:1)如果在你的产品注册信息或者配置文件中设置了ThrottleReader参数来限定最大的读取数据块的话,就会限制读取的速度(仅仅是在SESSION处理带约束的装载事务时明显存在问题的情况下,才进行这样的参数调整);2)把平面文件移动到本地磁盘上。尽量不要从网络或者是磁盘阵列上读取平面文件。大多数的磁盘阵列处理速度是很快的,但是INFORMATICA是个例外,而读取本地磁盘就非常的快。需要指出的是,连接(LINK)并不能提高速度,必须是将文件本身存储在本地磁盘上

10、尽量不使用无缓冲LOOKUP。使用无缓冲LOOKUP时,性能会受到显著的影响,尤其是如果LOOKUP的表是一个可增长或者是可更新的表,一般来讲这样的表在整个操作过程中它的索引是会发生变化的,因此优化器就无法利用索引的统计信息。同时,尽可能使用临时表,此时数据库中的视图可以将相关的数据关联起来,或者可以利用INFROMATICA的JOINER对象来关联数据,这两种都可以明显的提高数据处理速度。

11、分离复杂的MAPPING。试着将整个MAPPING分成一个个逻辑处理单元。如果需要进行并行的处理,重新进行体系结构的设计和布局。通过小的组件来处理单个的任务,可以提高整个处理过程的并行度,相关的细节请参见关于方法的讨论。

12、平衡,在INFORMATICA、SQL语句和数据库之间取得一种平衡。要充分利用数据库的特长:读、写、排序、分组、过滤,利用INFORMATICA来处理复杂的逻辑:外关联、数据继承、多数据源处理等等,这种平衡需要DBA的帮助来实现。为了达到这种平衡,需要根据各自的优势重新组织整个数据处理过程,利用数据库的处理能力并不是降低ETL工具的作用,相反是ETL工具的处理能力的加强,并且是进行大数据处理过程调优的必备条件。

13、调优数据库。要考虑不同数据集合的大小(包括小规模数据量、中等规模数据量、大量数据以及超大规模数据)对数据加载的时间的影响,并将这些信息提供给DBA,请求

Informatica性能调优(初级)

DBA对最坏情况下的数据库进行调优。帮助DBA估计哪些表会被进行大量的读写、什么处理会进行排序等占用数据库资源的操作,然后考虑将那些表放置到合适的物理磁盘上,这样会取得很好的效果。利用PERL脚本生成“假数据”来生成各种容量的测试集合,以此来检验MAPPING的加载性能,DBA就会根据这些情况来进行数据库的相关参数的调整。

14、确保在PMSERVER的机器上有足够的SWAP交换空间和临时空间。如果没有足够的磁盘空间,会导致处理性能成指数级的速度降低。因此可能需要在SESSION运行的时候监控磁盘空间,否则无法得到在操作过程中磁盘空间的变化情况,在MAPPING中含有AGGEGATOR、有缓冲的LOOKUP或者含有不同数据源的数据关联的操作的情况,更是有必要这样做。

15、在开发的过程中,在服务器上运行一些加载监控工具,以便更清楚相关的资源是如何被使用的,什么是热点资源。要遵循这些建议,因为有可能需要升级硬件设备来达到预期的处理能力。虽然价格比较昂贵,还是建议考虑EMC的磁盘存储阵列,因为它的处理速度相当的快,我听说(并没有确认)在某些情况下可以把性能提高50%左右。

16、设置SESSION。SESSION的属性中有很多可以用来调优。通过设置“Collect Performance Statistics”属性可以获得在MAPPING运行期间的一些性能方面的信息,从而可以对SESSION的其他属性进行修改,或者对数据库的参数进行调整,最终实现平衡的目的。仔细的阅读INFORMATICA联机文档中的调优手册。需要实现的基本目标是:读优化、处理优化、写优化。这三部分的过度优化可能最终导致SESSION的运行速度下降。例如,写速度受制于读的速度以及INFORMATICA的处理速度,反过来,读的速度又受制于INFORMATICA的处理速度和写的速度。调优一个有问题的MAPPING的最好方法是把他分成几个部分分别进行测试:1)读的处理,调优READER,检查相关的配置是什么,把读出来的数据输出到平面文件,以减少冲突。检查“ThrottleReader”参数(默认是不被设置的),用64K/slot的因子来提高默认的Buffer Size,不要考虑最高128K的警告。如果READER在SESSION运行过程中仍然是在开始的时候增长、但是在几千条数据后速度就稳定下来的情况,那么把Shared Session Memory从12MB提高到24MB,如果READER的速度仍然稳定不增长,那么MAPPING面对的就是一个比较慢的数据源、比较慢的LOOKUP或者缓冲区不是在本地磁盘。如果READER越过了它一直稳定的那个速度,记录下来此时的SESSION的设置。检查性能统计数据确保此时的WRITER不是瓶颈,因为此时是进行READER的性能调优,肯定不希望WRITER进程降低了READER的速度。然后将目标从平面文件改回到数据库,再次运行SESSION,记录下来READER的速度降低了多少,最优化的性能是在向平面文件中写入时的读取速度。这时,较慢的目标就是问题所在。注意:如果在向平面文件中写入的时候READER都不是很快,就要做一些基本的MAPPING调优工作了。尽量的合并EXPRESSION控件,设置LOOKUP为非连接的(如果可能就复用该LOOKUP),检查AGGREGATION和LOOKUP中的索引和数据缓冲区的大小。如果目标的写入速度比较慢,把MAPPING修改为一次写入一个目标表,然后可以找到引起写入速度慢的那个目标表,然后进行调优处理。对原始的MAPPING进行复制,然后将复制后的MAPPING进行分解。一旦发现了写入速度慢的目标表,请求DBA对表进行分区、更新表的统计数据、在数据加载过程中删除索引等等,有许多关于数据库的事情可以做,从而达到调优的目的。

17、将PMSERVER的机器上的所有其他应用都移走,处理数据库、临时数据库和数据仓库本身。PMSERVER可以和关系型数据库配合的很好,但是和其他的应用服务器配合的就很差,尤其是JAVA虚拟机、WEB SERVER、Security Servers、application SERVER以及Report servers。所有的这些应用都应该移到其他的机器,这一点对提高PMSERVER的机器的性能是非常关键的

Informatica性能调优(初级)

原文

INFORMATICA BASIC TUNING GUIDELINES

Dan Linstedt

The following points are high-level issues on where to go to perform "tuning" in Informatica's products. These are NOT permanent instructions, nor are they the end-all solution. Just some items (which if tuned first) might make a difference. The level of skill available for certain items will cause the results to vary.

To 'test' performance throughput it is generally recommended that the source set of data produce about 200,000 rows to process. Beyond this - the performance problems / issues may lie in the database - partitioning tables, dropping / re-creating indexes, striping raid arrays, etc... Without such a large set of results to deal with, you're average timings will be skewed by other users on the database, processes on the server, or network traffic. This seems to be an ideal test size set for producing mostly accurate averages.

Try tuning your maps with these steps first. Then move to tuning the session, iterate this sequence until you are happy, or cannot achieve better performance by continued efforts. If the performance is still not acceptable,. then the architecture must be tuned (which can mean changes to what maps are created). In this case, you can contact us - we tune the architecture and the whole system from top to bottom.

KEEP THIS IN MIND: In order to achieve optimal performance, it's always a good idea to strike a balance between the tools, the database, and the hardware resources. Allow each to do what they do best. Varying the architecture can make a huge difference in speed and optimization possibilities.

1. Utilize a database (like Oracle / Sybase / Informix / DB2 etc...) for significant data handling operations (such as sorts, groups, aggregates). In other words, staging tables can be a huge benefit to parallelism of operations. In parallel design - simply defined by mathematics, nearly always cuts your execution time. Staging tables have many benefits. Please see the staging table discussion in the methodologies section for full details.

2. Localize. Localize all target tables on to the SAME instance of Oracle (same SID), or same instance of Sybase. Try not to use Synonyms (remote database links) for anything (including: lookups, stored procedures, target tables, sources, functions, privileges, etc...). Utilizing remote links will most certainly slow things down. For Sybase users, remote mounting of databases can definitely be a hindrance to performance.

3. If you can - localize all target tables, stored procedures, functions, views, sequences in the SOURCE database. Again, try not to connect across synonyms. Synonyms (remote database tables) could potentially affect performance by as much as a factor of 3 times or more.

4. Remove external registered modules. Perform pre-processing / post-processing utilizing PERL, SED, AWK, GREP instead. The Application Programmers Interface (API) which calls externals is

Informatica性能调优(初级)

inherently slow (as of: 1/1/2000). Hopefully Informatica will speed this up in the future. The external module which exhibits speed problems is the regular expression module (Unix: Sun Solaris E450, 4 CPU's 2 GIGS RAM, Oracle 8i and Informatica). It broke speed from 1500+ rows per second without the module - to 486 rows per second with the module. No other sessions were running. (This was a SPECIFIC case - with a SPECIFIC map - it's not like this for all maps).

5. Remember that Informatica suggests that each session takes roughly 1 to 1 1/2 CPU's. In keeping with this - Informatica play's well with RDBMS engines on the same machine, but does NOT get along (performance wise) with ANY other engine (reporting engine, java engine, OLAP engine, java virtual machine, etc...)

6. Remove any database based sequence generators. This requires a wrapper function / stored procedure call. Utilizing these stored procedures has caused performance to drop by a factor of 3 times. This slowness is not easily debugged - it can only be spotted in the Write Throughput column. Copy the map, replace the stored proc call with an internal sequence generator for a test run - this is how fast you COULD run your map. If you must use a database generated sequence number, then follow the instructions for the staging table usage. If you're dealing with GIG's or Terabytes of information - this should save you lot's of hours tuning. IF YOU MUST - have a shared sequence generator, then build a staging table from the flat file, add a SEQUENCE ID column, and call a POST TARGET LOAD stored procedure to populate that column. Place the post target load procedure in to the flat file to staging table load map. A single call to inside the database, followed by a batch operation to assign sequences is the fastest method for utilizing shared sequence generators.

7. TURN OFF VERBOSE LOGGING. The session log has a tremendous impact on the overall performance of the map. Force over-ride in the session, setting it to NORMAL logging mode. Unfortunately the logging mechanism is not "parallel" in the internal core, it is embedded directly in to the operations.

8. Turn off 'collect performance statistics'. This also has an impact - although minimal at times - it writes a series of performance data to the performance log. Removing this operation reduces reliance on the flat file operations. However, it may be necessary to have this turned on DURING your tuning exercise. It can reveal a lot about the speed of the reader, and writer threads.

9. If your source is a flat file - utilize a staging table (see the staging table slides in the presentations section of this web site). This way - you can also use SQL*Loader, BCP, or some other database Bulk-Load utility. Place basic logic in the source load map, remove all potential lookups from the code. At this point - if your reader is slow, then check two things: 1) if you have an item in your registry or configuration file which sets the "ThrottleReader" to a specific maximum number of blocks, it will limit your read throughput (this only needs to be set if the sessions have a demonstrated problems with constraint based loads) 2) Move the flat file to local internal disk (if at all possible). Try not to read a file across the network, or from a RAID device. Most RAID array's are fast, but Informatica seems to top out, where internal disk continues to be much faster. Here - a link will NOT work to increase speed - it must be the full file itself - stored locally.

10. Try to eliminate the use of non-cached lookups. By issuing a non-cached lookup, you're performance will be impacted significantly. Particularly if the lookup table is also a "growing" or "updated" target table - this generally means the indexes are changing during operation, and the

Informatica性能调优(初级)

optimizer looses track of the index statistics. Again - utilize staging tables if possible. In utilizing staging tables, views in the database can be built which join the data together; or Informatica's joiner object can be used to join data together - either one will help dramatically increase speed.

11. Separate complex maps - try to break the maps out in to logical threaded sections of processing. Re-arrange the architecture if necessary to allow for parallel processing. There may be more smaller components doing individual tasks, however the throughput will be proportionate to the degree of parallelism that is applied. A discussion on HOW to perform this task is posted on the methodologies page, please see this discussion for further details.

12. BALANCE. Balance between Informatica and the power of SQL and the database. Try to utilize the DBMS for what it was built for: reading/writing/sorting/grouping/filtering data en-masse. Use Informatica for the more complex logic, outside joins, data integration, multiple source feeds, etc... The balancing act is difficult without DBA knowledge. In order to achieve a balance, you must be able to recognize what operations are best in the database, and which ones are best in Informatica. This does not degrade from the use of the ETL tool, rather it enhances it - it's a MUST if you are performance tuning for high-volume throughput.

13. TUNE the DATABASE. Don't be afraid to estimate: small, medium, large, and extra large source data set sizes (in terms of: numbers of rows, average number of bytes per row), expected throughput for each, turnaround time for load, is it a trickle feed? Give this information to your DBA's and ask them to tune the database for "wost case". Help them assess which tables are expected to be high read/high write, which operations will sort, (order by), etc... Moving disks, assigning the right table to the right disk space could make all the difference. Utilize a PERL script to generate "fake" data for small, medium, large, and extra large data sets. Run each of these through your mappings - in this manner, the DBA can watch or monitor throughput as a real load size occurs.

14. Be sure there is enough SWAP, and TEMP space on your PMSERVER machine. Not having enough disk space could potentially slow down your entire server during processing (in an exponential fashion). Sometimes this means watching the disk space as while your session runs. Otherwise you may not get a good picture of the space available during operation. Particularly if your maps contain aggregates, or lookups that flow to disk Cache directory - or if you have a JOINER object with heterogeneous sources.

15. Place some good server load monitoring tools on your PMServer in development - watch it closely to understand how the resources are being utilized, and where the hot spots are. Try to follow the recommendations - it may mean upgrading the hardware to achieve throughput. Look in to EMC's disk storage array - while expensive, it appears to be extremely fast, I've heard (but not verified) that it has improved performance in some cases by up to 50%

16. SESSION SETTINGS. In the session, there is only so much tuning you can do. Balancing the throughput is important - by turning on "Collect Performance Statistics" you can get a good feel for what needs to be set in the session - or what needs to be changed in the database. Read the performance section carefully in the Informatica manuals. Basically what you should try to achieve is: OPTIMAL READ, OPTIMIAL THROUGHPUT, OPTIMAL WRITE. Over-tuning one of these three pieces can result in ultimately slowing down your session. For example: your write throughput is governed by your read and transformation speed, likewise, your read throughput is governed by your transformation and write speed. The best method to tune a problematic map, is to break it in to components for testing: 1) Read Throughput, tune for the

Informatica性能调优(初级)

reader, see what the settings are, send the write output to a flat file for less contention - Check the "ThrottleReader" setting (which is not configured by default), increase the Default Buffer Size by a factor of 64k each shot - ignore the warning above 128k. If the Reader still appears to increase during the session, then stabilize (after a few thousand rows), then try increasing the Shared Session Memory from 12MB to 24MB. If the reader still stabilizes, then you have a slow source, slow lookups, or your CACHE directory is not on internal disk. If the reader's throughput continues to climb above where it stabilized, make note of the session settings. Check the Performance Statistics to make sure the writer throughput is NOT the bottleneck - you are attempting to tune the reader here, and don't want the writer threads to slow you down. Change the map target back to the database targets - run the session again. This time, make note of how much the reader slows down, it's optimal performance was reached with a flat file(s). This time - slow targets are the cause. NOTE: if your reader session to flat file just doesn't ever "get fast", then you've got some basic map tuning to do. Try to merge expression objects, set your lookups to unconnected (for re-use if possible), check your Index and Data cache settings if you have aggregation, or lookups being performed. Etc... If you have a slow writer, change the map to a single target table at a time - see which target is causing the "slowness" and tune it. Make copies of the original map, and break down the copies. Once the "slower" of the N targets is discovered, talk to your DBA about partitioning the table, updating statistics, removing indexes during load, etc... There are many database things you can do here.

17. Remove all other "applications" on the PMServer. Except for the database / staging database or Data Warehouse itself. PMServer plays well with RDBMS (relational database management system) - but doesn't play well with application servers, particularly JAVA Virtual Machines, Web Servers, Security Servers, application, and Report servers. All of these items should be broken out to other machines. This is critical to improving performance on the PMServer machine.

Informatica性能调优(初级)

鸣谢

本文来源:https://www.bwwdw.com/article/c7a1.html

Top